eQTLGen phase II cookbook

This cookbook is for running genome-wide eQTL mapping in the consortium settings, by using HASE (Roschupkin et al, 2016; original code repo). HASE method enables to run of genome-wide high-dimensional association analyses in consortium settings by limiting the size of the shared data while preserving participant privacy.

In this cookbook we perform the required steps for performing eQTL analysis: we do data QC, remove related samples, do genotype phasing and imputation, convert genotype data to the needed format, prepare covariates, convert expression and covariate data into the needed format, encode the data, calculate partial derivatives, prepare files for permuted data, and organize non-personal encoded data and partial derivatives for sharing with the central site. We also include pipelines that will be used for running meta-analyses on the central site.

For eQTLGen phase II we perform this analysis for whole blood and PBMC tissues and in the individual datasets with at least 100 samples.

Why this method?

Classical meta-analysis requires that each cohort performs a full GWAS for every gene (~20,000). Full summary statistics are large, meaning that, per every cohort, several terabytes of data would be needed to be shared with the central site. Sharing that much data is technically very complicated.

In contrast, the HASE method calculates matrices of aggregate statistics (called partial derivatives) which are needed for running predefined models and, additionally, encoded genotype and expression matrices. All those matrices are in the efficient file format and relatively small size, making it feasible to share those. Encoding means performing a matrix multiplication of the original data with the random matrix F (or its inverse, in case of expression matrix), so that no person-level information is obtainable from the genotype and expression matrices after encoding. However, sharing partial derivatives and encoded matrices with the central site enables running of pre-defined association tests between every variant and every gene for every dataset; and finally, running the meta-analysis over datasets. An additional merit of HASE is that it is possible adaptively drop covariates in the encoded association test (by slicing the aggregate partial derivative matrices), therefore enabling to make informed decisions on the most suitable set of covariates to include into the final meta-analysis. Also, this method shifts most of the computational burden into the central site, so individual cohorts do not need to run ~20,000 GWAS’es in their respective HPCs.

You can see the specifics of the method here: Roschupkin et al, 2016

Technical support

In case of any issues when running this cookbook please contact Urmo Võsa (urmo.vosa at gmail.com) and Robert Warmerdam (c.a.warmerdam at umcg.nl).

Prerequisites

Needed

  1. HPC with multiple cores, UNIX/Linux and scheduling system.
  2. You need to have Bash >=3.2 installed to your HPC.
  3. You need to have Java >=11 installed to your HPC. This is needed for running Nextflow pipeline management tool.
  1. Our pipelines expect that you have Singularity configured and running in your HPC, for managing needed tools/dependencies. This means that you don’t need to install many programs to your HPC. If there is no way of using Singularity in your HPC, please contact us and we will look into alternative ways for dependency management.
  2. You might also need a few extra modules, so that Singularity would run as expected. If this is the case, we recommend to check this with your HPC documentation and/or tech support. E.g. in University of Tartu HPC there is also module named squashFS needed, so that Singularity would work correctly.
  3. You HPC should have access to the internet. This is needed for downloading analysis pipelines from code repos, for automatic download of the containers from container repos and for downloading some reference files. If it is mandatory for you to work offline, please use these instructions for getting the pipelines working offline. Please contact lead analysts in case additional help is needed.
  1. git. This is convenient for cloning analysis pipelines from the code repos. However, you can also download those manually.
  2. Current analysis pipelines and configurations in this cookbook are tailored for usage with Slurm scheduler. We also provide profiles for two other commonly used schedulers (PBS/TORQUE and SGE; we have no possibility to test but these should work). It is also straightforward to add configurations for other commonly used schedulers. Therefore, if your HPC uses some other scheduler, please contact us and we will try to make the configuration file for that specific scheduler.

Setup

The analysis consists of four Nextflow pipelines which are used to perform all needed tasks in the correct order. For each pipeline, we provide a Slurm script template which should be complemented/adjusted with your input paths and a few dataset-specific settings. Those pipelines are:

We will provide detailed instructions for each pipeline below, however for the most optimal setup, we recommend you to first organize your analysis folder as follows:

❗ If this does not work, it might be that your HPC has restrictions in connecting with internet. If this is the case please use these instructions for trying to install Nextflow offline. Please contact lead analysts in case additional help is needed.

❗ To ease the process, we have pre-filled some paths in the templates, assuming that you use the recommended folder structure.

Offline use

In some HPCs the internet accessibility is restricted and this causes error messages when trying to install Nextflow and running it. This is because, by default, Nextflow executable tries to contact internet and check if newer version of Nextflow is available. Also, by default, first Data QC pipeline tries to automatically download plink executables and 1000G reference panel. In order to get the setup working offline, please follow these instructions. Please consult with lead analysts, in case abovementioned instructions are not solving the issues.

Running, monitoring and debugging Nextflow pipelines

Example commands here are based on the Slurm scheduler. However, this can be easily changed to other schedulers.

Hopefully those three files can give you some handle which might have gone wrong. Contact us, if you think you need any help to resolve the issue.

Data

For this analysis plan you need the following information and data:

Name of your cohort

An informative name for your dataset is one of the required inputs for the majority of the pipelines in this cookbook. This could be the combination of cohort name and expression platform. If you contribute with multiple datasets from e.g. several batches, you could specify it by adding _batch1. If you contribute with multiple datasets from several ancestries, then you could specify this by adding ancestry e.g. _EUR.

Example 1: For EstBB cohort I have 2 datasets EstBB_HT12v3 and EstBB_RNAseq.

Example 2: If there would be two EstBB RNAseq datasets, I would use: EstBB_RNAseq_batch1 and EstBB_RNAseq_batch2.

Genotype data

❗ Standard pre-imputation genotype and sample quality control will be done by our pipeline, so no need for thorough QC.

Gene expression data

❗ For RNA-seq data we expect that the raw RNA-seq data has already been appropriately processed and QC’d prior to generating the count matrix (e.g. proper settings for the read mapping, etc.).

❗ For Affymetrix arrays we expect that all the standard data preprocessing steps for this array type have already been applied (e.g. RMA normalization). On top of those, our pipeline just applies inverse normal transformation on each gene (forces the gene expression distribution to be normal).

❗ For Affymetrix arrays we have currently used probe-to-gene linking as used in eQTLGen phase 1. It is possible that you have some other format for probe IDs: if this is the case, please contact us and we will update the pipeline accordingly.

Genotype-to-expression (GTE) linking file

If you participated in eQTLGen phase 1 analyses, this file should already be available.

Additional covariates

As standard, within the dataQC step a number of default covariates are automatically constructed from both genotype data and expression data. Respectively, we calculate 10 genotype and 100 expression PCs from from the the data. The genotype-inferred sex is used as an additional covariate. Optionally, an additional set of covariates can be supplied, if you know that these are relevant for given dataset. For that please make a file with additional cohort-specific covariates:

Mixups

We assume that potential sample mixups have already been identified, resolved or removed in the previous research. E.g. see details from here.

Analysis instructions

1. Data QC

In this step, we prepare the genotype and gene expression data for the next steps. Specifically, the pipeline does the following steps:

Input files
Genotype data

❗ Because pre-phasing of genotypes benefits from larger sample sizes and many eQTL datasets have modest sample sizes (N<1,000), it is not advisable to prefilter the unimputed genotype data to include only eQTL samples. Our pipeline will extract eQTL samples itself (based on the genotype-to-expression file) and, if additional samples are available, includes additional up to 5,000 genotype samples which will be kept until the pre-phasing step of the analysis. If the genotype data is available for <5,000 samples, the pipeline uses only those samples which are available.

Expression data

❗ For Affymetrix arrays- if you did not participate in eQTLGen phase 1, it might be that your probe name format does not align with our pipeline. Please contact with urmo.vosa at gmail.com and send some information, how the probe naming and annotation was done in your data. We will then update the pipeline accordingly.

❗ For Affymetrix arrays- if you did not participate in eQTLGen phase 1, it might be that your probe name format does not align with our pipeline. Please contact with urmo.vosa at gmail.com and send some information, how the probe naming and annotation was done in your data. We will then update the pipeline accordingly.

Additional files
Instructions
  1. Go into eQTLGen_phase2/1_DataQC.
  2. Clone the DataQC pipeline git clone https://github.com/eQTLGen/DataQC.git.
  3. Make folder output.
  4. Go into pipeline folder eQTLGen_phase2/1_DataQC/DataQC and adjust the script template with needed modules, input paths and settings. Aside from the paths to your data files, you must specify:
  5. Expression platform of your expression data (--exp_platform). Options: Illumina arrays: HT12v3, HT12v4 or HuRef8; RNA-seq: RNAseq; Affymetrix arrays: AffyU219 or AffyHumanExon.
  6. Informative name for your dataset (--cohort_name): your unique dataset name used by all pipelines.
  7. The genome build of your dataset (--genome_build): hg19, GRCh37, hg38 or GRCh38 (GRCh37 prefilled).
  8. If your dataset concerns a VCF dataset, replace the --bfile argument with the --vcf argument. The --vcf argument expects a full path to a vcf dataset. Pathname extension using globbing is allowed (using * or ?), but the path should be provided without pathway extension.
  9. You can select scheduler type by adjusting the profile as following:
  1. If your cluster setup does not allow internet connection, you can specify paths to local plink executables and reference files (arguments: --plink_executable, --plink2_executable and --reference_1000g_folder). Contact lead analysts for further guidance, if needed.

This is the template for Slurm scheduler (submit_DataQC_pipeline_template.sh):

#!/bin/bash

#SBATCH --time=48:00:00
#SBATCH -N 1
#SBATCH --ntasks-per-node=1
#SBATCH --mem=6G
#SBATCH --mail-type=BEGIN
#SBATCH --mail-type=END
#SBATCH --mail-type=FAIL
#SBATCH --job-name="DataQc"

# These are needed modules in UT HPC to get singularity and Nextflow running. Replace with appropriate ones for your HPC.
module load java-1.8.0_40
module load singularity/3.5.3
module load squashfs/4.4

# If you follow the eQTLGen phase II cookbook and analysis folder structure,
# some of the following paths are pre-filled.
# https://github.com/eQTLGen/eQTLGen-phase-2-cookbook/wiki/eQTLGen-phase-II-cookbook

# We set the following variables for nextflow to prevent writing to your home directory (and potentially filling it completely)
# Feel free to change these as you wish.
export SINGULARITY_CACHEDIR=../../singularitycache
export NXF_HOME=../../nextflowcache

# Disable pathname expansion. Nextflow handles pathname expansion by itself.
set -f

# Define paths
nextflow_path=../../tools # folder where Nextflow executable is

# Genotype data
bfile_path=[full path to your input genotype files without .bed/.bim/.fam extension]

# Other data
exp_path=[full path to your gene expression matrix]
gte_path=[full path to your genotype-to-expression file]
exp_platform=[expression platform name: HT12v3/HT12v4/HuRef8/RNAseq/AffyU219/AffyHumanExon]
cohort_name=[name of the cohort]
genome_build="GRCh37"
output_path=../output # Output path

# Additional settings and optional arguments for the command

# --GenOutThresh [numeric threshold]
# --GenSdThresh [numeric threshold]
# --ExpSdThresh [numeric threshold]
# --ContaminationSlope [number between 0 and 90, default 45]
# --ContaminationArea [number between 0 and 90, default 30]
# --AdditionalCovariates [file with additional covariates]
# --InclusionList [file with the list of samples to restrict the analysis]
# --ExclusionList [file with the list of samples to remove from the analysis]
# --preselected_sex_check_vars "data/Affy6_pruned_chrX_variant_positions.txt"
# --AdditionalCovariates [file with additional covariates. First column should be `SampleID`]
# --gen_qc_steps 'WGS'
# --fam [PLINK .fam file. Takes precedence over .fam file supplied with `--bfile`]
# --plink_executable [path to plink executable (PLINK v1.90b6.26 64-bit)]
# --plink2_executable [path to plink2 executable (PLINK v2.00a3.7LM 64-bit Intel)]
# --reference_1000g_folder [path to folder with 1000G reference data]

# Command:
NXF_VER=21.10.6 ${nextflow_path}/nextflow run DataQC.nf \
--bfile ${bfile_path} \
--expfile ${exp_path} \
--gte ${gte_path} \
--exp_platform ${exp_platform} \
--cohort_name ${cohort_name} \
--genome_build ${genome_build} \
--outdir ${output_path}  \
-profile slurm,singularity \
-resume
  1. Submit the job.

  2. Go into the output folder, download and investigate the thorough report file Report_DataQc_[your cohort name].html. You will find diagnostic plots and instructions on how to proceed from this report.

  3. According to the instructions in the QC report, it is likely that you need to adjust some of the QC thresholds. In your slurm script, you can add the some or all of the arguments --GenOutThresh, --GenSdThresh, --ExpSdThresh, --ContaminationSlope, --ContaminationArea, and specify the appropriate values for each of those.

  4. In WGS datasets many samples might not fit the thresholds set on the X-chromosome heterozygosity (F<0.2 and F>0.8). To resolve this you can add the argument --preselected_sex_check_vars ${sex_check_vars_path}, which will make the method use a preselected set of variants selected based on array data.

  5. In some cases, you might also need to split the data into several batches, according to the ancestry. Then you should run this and following pipelines for each of the batches separately.

  6. Resubmit the job.

  7. Go into the output folder, download and investigate the report file Report_DataQc_[your cohort name].html. Now QC plots should look as expected.

  8. If some of the plots still indicate issues, adjust the arguments and re-run the pipeline. You might need to do so a couple of times until there are no more apparent issues.

  9. 🎉 Done!

Output

Pipeline gives output into the specified directory. The important files are following.

  1. Files: output/outputfolder_gen/gen_data_QCd/*_ToImputation.bed, output/outputfolder_gen/gen_data_QCd/*_ToImputation.bim, output/outputfolder_gen/gen_data_QCd/*_ToImputation.fam are the filtered and QCd genotype files which need to be the input for the next, imputation pipeline. You need to specify the path to the base name (no extension .bed/.bim/.fam) as an input.
  2. The whole output folder should be specified as one of the inputs for per-cohort preparations pipeline. This pipeline automatically uses the processed, QCd expression data and covariate file to run data encoding and partial derivative calculation. It then organises the encoded matrices for sharing with the central site. It also extracts some QC files, plots and summaries for sharing with the central site.

❗ While continuing with the next step, now it is also good time to send the Report_DataQc_[your cohort name].html to Urmo Võsa (urmo.vosa at gmail.com) and Robert Warmerdam (c.a.warmerdam at umcg.nl). This way we can have a initial look to the quality of the data and let you know if something seems unoptimal.

2. Genotype imputation

In this step we impute the QC’d genotype data to the recently released 1000G 30X WGS reference panel.

Specifically, this pipeline performs the following steps:

Input files
Instructions
  1. Go into eQTLGen_phase2/2_Imputation.

  2. Clone the eQTLGenImpute pipeline git clone https://github.com/eQTLGen/eQTLGenImpute.git.

  3. Make folder output.

  4. Download the zipped reference wget https://www.dropbox.com/s/6g58ygjg9d2fvbi/eQTLGenReferenceFiles.tar.gz?dl=1.

  5. Because this file is ~30GB and can corrupt during the download, also please download corresponding md5sum file wget https://www.dropbox.com/s/ekfciajzevn6o1l/eQTLGenReferenceFiles.tar.gz.md5?dl=1.

  6. Check if download was successful md5sum --check eQTLGenReferenceFiles.tar.gz.md5. You should see this message in your terminal: eQTLGenReferenceFiles.tar.gz:OK.

  7. Unzip the reference file tar -xfvz eQTLGenReferenceFiles.tar.gz. This yields the folder named hg38 which contains all needed reference files.

  8. Go into eQTLGen_phase2/2_Imputation/eQTLGenImpute pipeline folder and adjust the imputation script template with needed modules and inputs

  9. Path to QCd genotype data: path pointing to unimputed and QC’d genotype files in plink .bed/.bim/.fam format, ending with *_ToImputation. These files are in the output of previous data QC pipeline, ending with *_ToImputation.* and are in the folder output/outputfolder_gen/gen_data_QCd. This path is pre-filled.

  10. Path to reference folder (pre-filled).

  11. Output path: to the folder output you made (pre-filled).

  12. Cohort name: this should be the same as in DataQC.

  13. Path to the folder with reference files (pre-filled).

  14. The genome build of your dataset (--genome_build): hg19, GRCh37, hg38 or GRCh38 (hg19/GRCh37 by default).

  15. You can select scheduler type by adjusting the profile as following:

    • Slurm: -profile slurm,singularity
    • PBS/TORQUE: -profile pbs,singularity
    • SGE: -profile sge,singularity

This is the template for Slurm scheduler (submit_imputation_pipeline_template.sh):

#!/bin/bash

#SBATCH --time=72:00:00
#SBATCH -N 1
#SBATCH --ntasks-per-node=1
#SBATCH --mem=6G
#SBATCH --mail-type=BEGIN
#SBATCH --mail-type=END
#SBATCH --mail-type=FAIL
#SBATCH --job-name="ImputeGenotypes"

# These are needed modules in UT HPC to get Singularity and Nextflow running.
# Replace with appropriate ones for your HPC.
module load java/11.0.2
module load singularity/3.5.3
module load squashfs/4.4

# If you follow the eQTLGen phase II cookbook and analysis folder structure,
# some of the following paths are pre-filled.
# https://github.com/eQTLGen/eQTLGen-phase-2-cookbook/wiki/eQTLGen-phase-II-cookbook

# We set the following variables for nextflow to prevent writing to your home directory (and potentially filling it completely)
# Feel free to change these as you wish.
export SINGULARITY_CACHEDIR=../../singularitycache
export NXF_HOME=../../nextflowcache

# Define paths and arguments
nextflow_path=../../tools # folder where Nextflow executable is.
reference_path=../hg38 # folder where you unpacked the reference files.

cohort_name=[name of your cohort]
qc_input_folder=../../1_DataQC/output/ # folder with QCd genotype and expression data, output of DataQC pipeline.
output_path=../output/ # Output path.
genome_build="GRCh37"

# Command
NXF_VER=21.10.6 ${nextflow_path}/nextflow run eQTLGenImpute.nf \
--qcdata ${qc_input_folder} \
--target_ref ${reference_path}/genome_reference/Homo_sapiens.GRCh38.dna.primary_assembly.fa \
--ref_panel_hg38 ${reference_path}/harmonizing_reference/30x-GRCh38_NoSamplesSorted \
--eagle_genetic_map ${reference_path}/phasing_reference/genetic_map/genetic_map_hg38_withX.txt.gz \
--eagle_phasing_reference ${reference_path}/phasing_reference/phasing/ \
--minimac_imputation_reference ${reference_path}/imputation_reference/ \
--cohort_name ${cohort_name} \
--genome_build ${genome_build} \
--outdir ${output_path}  \
-profile slurm,singularity \
-resume
  1. Submit the job.

  2. Check the output/pipeline_info/imputation_report.html.

  3. 🎉 Done!

Output

The pipeline gives output into the specified directory. The important files are following.

  1. Folder output/postimute/ should be specified as the input for the next step, genotype conversion. This folder contains imputed .vcf.gz files, filtered by MAF>0.01.
  2. Folder output/preimpute/ contains the formatted, filtered, and QCd genotype data before imputation (.vcf.gz). This is not directly needed for this cookbook and might be used for other applications, debugging the imputation, or just deleted.

3. Genotype conversion

In this step, we convert the imputed genotype files from .vcf.gz format into efficient .hdf5 format. This file format is native to HASE and is needed for running the following HASE steps (encoding, partial derivative computation) in a more efficient way.

Specifically, this pipeline performs the following steps:

These quality metrics will be used for performing additional per-variant QC in the central site.

Input files
Instructions
  1. Go into eQTLGen_phase2/3_ConvertVcf2Hdf5.

  2. Clone the ConvertVcf2Hdf5 pipeline git clone https://github.com/eQTLGen/ConvertVcf2Hdf5.git.

  3. Make folder output

  4. Go into 3_ConvertVcf2Hdf5/ConvertVcf2Hdf5 and adjust the genotype conversion script template with needed modules and inputs.

  5. Imputed genotype path (pre-filled).

  6. Output path (pre-filled).

  7. You can select scheduler type by adjusting the profile as following:

    • Slurm: -profile slurm,singularity
    • PBS/TORQUE: -profile pbs,singularity
    • SGE: -profile sge,singularity

This is the template for Slurm scheduler (submit_genotype_conversion_template.sh):

#!/bin/bash

#SBATCH --time=24:00:00
#SBATCH -N 1
#SBATCH --ntasks-per-node=1
#SBATCH --mem=5G
#SBATCH --job-name="ConvertVcf2Hdf5"

# These are needed modules in UT HPC to get Singularity and Nextflow running.
# Replace with appropriate ones for your HPC.
module load java/11.0.2
module load singularity/3.5.3
module load squashfs/4.4

# If you follow the eQTLGen phase II cookbook and analysis folder structure,
# some of the following paths are pre-filled.
# https://github.com/eQTLGen/eQTLGen-phase-2-cookbook/wiki/eQTLGen-phase-II-cookbook

# We set the following variables for nextflow to prevent writing to your home directory (and potentially filling it completely)
# Feel free to change these as you wish.
export SINGULARITY_CACHEDIR=../../singularitycache
export NXF_HOME=../../nextflowcache

nextflow_path=../../tools/  # Path to Nextflow executable, no need to adjust if folder structure is same as recommended in cookbook.

cohortname=[name of your cohort]
genopath=../../2_Imputation/output/postimpute   # Folder with input genotype files in .vcf.gz format, no need to adjust if the folder structure is same as recommended in cookbook.
outputpath=../output/ # Path to output folder, no need to adjust if the folder structure is same as recommended in cookbook.

NXF_VER=21.10.6 ${nextflow_path}/nextflow run ConvertVcf2Hdf5.nf \
--vcf ${genopath} \
--outdir ${outputpath} \
--cohort_name ${cohortname} \
-profile slurm,singularity \
-resume
  1. Submit the job.

  2. Check the output/pipeline_info/ConvertGenotype_report.html.

  3. 🎉 Done!

Output

After successful completion of the pipeline, there should be hdf5 genotype file structure in your output folder, which, in addition to pipeline_info, contains folders named individuals, probes, genotype and SNPQC. For eQTLGen phase 2 analyses, this folder is one out of two inputs of per-cohort data preparations pipeline.

4. Per-cohort data preparations

This step prepares the data and performs all the steps which are needed for running encoded HASE meta-analysis in the central site.

Specifically, this pipeline performs the following steps:

Input files
Instructions
  1. Go into eQTLGen_phase2/4_PerCohortPreparations.

  2. Make folder output

  3. Clone the PerCohortPreparations pipeline git clone https://github.com/eQTLGen/PerCohortDataPreparations.git.

  4. Go into 4_PerCohortPreparations/PerCohortPreparations.

  5. Put genotype reference file into 4_PerCohortPreparations/PerCohortPreparations/bin/hase/data/ folder. This was already downloaded together with imputation references in Imputation step. Therefore you can copy it to the correct location like that: cp ../../2_Imputation/hg38/hase_reference/* bin/hase/data/..

  6. Adjust the template script.

  7. Path to the genotypes in hdf5 format (pre-filled)

  8. Path to QC’d expression data (pre-filled)

  9. Output path (pre-filled)

  10. You can select scheduler type by adjusting the profile as following:

    • Slurm: -profile slurm,singularity
    • PBS/TORQUE: -profile pbs,singularity
    • SGE: -profile sge,singularity

This is the template script for Slurm scheduler (submit_per_cohort_preparations_template.sh):

#!/bin/bash

#SBATCH --time=24:00:00
#SBATCH -N 1
#SBATCH --ntasks-per-node=1
#SBATCH --mem=5G
#SBATCH --job-name="RunDataPreparations"

# These are needed modules in UT HPC to get Singularity and Nextflow running.
# Replace with appropriate ones for your HPC.
module load java/11.0.2
module load singularity/3.5.3
module load squashfs/4.4

# If you follow the eQTLGen phase II cookbook and analysis folder structure,
# some of the following paths are pre-filled.
# https://github.com/eQTLGen/eQTLGen-phase-2-cookbook/wiki/eQTLGen-phase-II-cookbook

# We set the following variables for nextflow to prevent writing to your home directory (and potentially filling it completely)
# Feel free to change these as you wish.
export SINGULARITY_CACHEDIR=../../singularitycache
export NXF_HOME=../../nextflowcache

nextflow_path=../../tools

genotypes_hdf5=../../3_ConvertVcf2Hdf5/output # Folder with genotype files in .hdf5 format
qc_data_folder=../../1_DataQC/output # Folder containing QCd data, inc. expression and covariates
output_path=../output

NXF_VER=21.10.6 ${nextflow_path}/nextflow run PerCohortDataPreparations.nf \
--hdf5 ${genotypes_hdf5} \
--qcdata ${qc_data_folder} \
--outdir ${output_path} \
-profile slurm,singularity \
-resume
  1. Submit the job.

  2. Check the output/pipeline_info/PerCohortDataPreparations_report.html.

  3. 🎉 Done!

Output

5. Share the data with central location

The instructions for getting the SFTP account in UT HPC: send an e-mail to urmo.vosa at gmail.com and we will arrange the account. We will also instruct how to upload the needed data to SFTP server.

The output folder from per-cohort preparations pipeline named [YourCohortName]_IntermediateFilesEncoded_to_upload and its accompanying .md5 file should be uploaded into our sftp server.

❗ You might want to inform your HPC team about this project and the planned upload. Although the sheer file size is much smaller that would be for the classical meta-analysis, it is still ~50GB in case of 500 samples and RNA-seq. For the few largest datasets, the upload might be up to 500GB and might raise some red flags when monitoring the data traffic ;).

🎉 Thank you, you have shared the data necessary for global eQTL mapping! We will keep you updated and will re-contact if any additional information is needed.

Central analyses

Following pipelines are for performing global trans-eQTL meta-analysis in the central location and are here outlined FYI. If you are cohort analyst, no further action is needed. However, feel free to use this code and instructions to conduct further analyses in your respective cohort.

Pipeline for running encoded meta-analysis - this is for running encoded eQTL meta-analysis on one or several datasets.

Pipeline for extracting subsets of data from full summary statistics - this is for extracting data subsets for subsequent interpretation.

We have been working on more convenient output formats and pipelines that allow for quicker extractions of parts of the data. These updates are not yet included in the github repositories. These will be updated soon.

Runtime estimates

Here are some runtime benchmarks for each of the steps in this cookbook: these might help you to get some hint how much time running the pipelines might take.

All benchmarks are for RNA-seq dataset with following features:

Data QC

Note: benchmarks to initial run with default settings, you probably need to re-run at least once after adjusting the settings.

Imputation

Genotype conversion

Per-cohort data preparation

Acknowledgements

This cookbook utilizes HASE (https://github.com/roshchupkin/hase) and some of its helper scripts originally developed by:

Gennady V. Roscupkin (Department of Epidemiology, Radiology and Medical Informatics, Erasmus MC, Rotterdam, Netherlands)

Hieab H. Adams (Department of Epidemiology, Erasmus MC, Rotterdam, Netherlands).

Changes from the original HASE repo

Robert Warmerdam (Department of Genetics, University Medical Center Groningen, University of Groningen, Groningen, Netherlands) modified the original HASE and fixed some bugs. He also added classic-meta option to HASE analyses which enables to perform (encoded) inverse-variance weighted meta-analysis and has started implementing methods for encoded interaction analysis.

Urmo Võsa (Institute of Genomics, University of Tartu, Tartu, Estonia) incorporated it into Nextflow pipeline and applied some minor customization to the parts of the code.

Changes:

Citation

Original method paper for HASE framework:

Roshchupkin, G. V., H. H.H. Adams, M. W. Vernooij, A. Hofman, C. M. Van Duijn, M. A. Ikram, and W. J. Niessen. “HASE: Framework for Efficient High-Dimensional Association Analyses.” Scientific Reports 6, no. 1 (December 26, 2016): 36076. https://doi.org/10.1038/srep36076.

1000G 30X imputation reference:

Byrska-Bishop, Marta, Uday S. Evani, Xuefang Zhao, Anna O. Basile, Haley J. Abel, Allison A. Regier, André Corvelo, et al. “High-Coverage Whole-Genome Sequencing of the Expanded 1000 Genomes Project Cohort Including 602 Trios.” Cell 185, no. 18 (September 1, 2022): 3426-3440.e19. https://doi.org/10.1016/j.cell.2022.08.004.

Nextflow workflow management tool:

Di Tommaso, Paolo, Maria Chatzou, Evan W Floden, Pablo Prieto Barja, Emilio Palumbo, and Cedric Notredame. “Nextflow Enables Reproducible Computational Workflows.” Nature Biotechnology 35, no. 4 (2017): 316–19. https://doi.org/10.1038/nbt.3820.

Contacts

For this cookbook: Urmo Võsa (urmo.vosa at gmail.com) and Robert Warmerdam (c.a.warmerdam at umcg.nl)

For the method of HASE, please find contacts from here: https://github.com/roshchupkin/hase