This cookbook is for running genome-wide eQTL mapping in the consortium settings, by using HASE (Roschupkin et al, 2016; original code repo). HASE method enables to run of genome-wide high-dimensional association analyses in consortium settings by limiting the size of the shared data while preserving participant privacy.
In this cookbook we perform the required steps for performing eQTL analysis: we do data QC, remove related samples, do genotype phasing and imputation, convert genotype data to the needed format, prepare covariates, convert expression and covariate data into the needed format, encode the data, calculate partial derivatives, prepare files for permuted data, and organize non-personal encoded data and partial derivatives for sharing with the central site. We also include pipelines that will be used for running meta-analyses on the central site.
For eQTLGen phase II we perform this analysis for whole blood and PBMC tissues and in the individual datasets with at least 100 samples.
Classical meta-analysis requires that each cohort performs a full GWAS for every gene (~20,000). Full summary statistics are large, meaning that, per every cohort, several terabytes of data would be needed to be shared with the central site. Sharing that much data is technically very complicated.
In contrast, the HASE method calculates matrices of aggregate statistics (called partial derivatives) which are needed for running predefined models and, additionally, encoded genotype and expression matrices. All those matrices are in the efficient file format and relatively small size, making it feasible to share those. Encoding means performing a matrix multiplication of the original data with the random matrix F (or its inverse, in case of expression matrix), so that no person-level information is obtainable from the genotype and expression matrices after encoding. However, sharing partial derivatives and encoded matrices with the central site enables running of pre-defined association tests between every variant and every gene for every dataset; and finally, running the meta-analysis over datasets. An additional merit of HASE is that it is possible adaptively drop covariates in the encoded association test (by slicing the aggregate partial derivative matrices), therefore enabling to make informed decisions on the most suitable set of covariates to include into the final meta-analysis. Also, this method shifts most of the computational burden into the central site, so individual cohorts do not need to run ~20,000 GWAS’es in their respective HPCs.
You can see the specifics of the method here: Roschupkin et al, 2016
In case of any issues when running this cookbook please contact Urmo Võsa (urmo.vosa at gmail.com) and Robert Warmerdam (c.a.warmerdam at umcg.nl).
The analysis consists of four Nextflow pipelines which are used to perform all needed tasks in the correct order. For each pipeline, we provide a Slurm script template which should be complemented/adjusted with your input paths and a few dataset-specific settings. Those pipelines are:
.hdf5
format.We will provide detailed instructions for each pipeline below, however for the most optimal setup, we recommend you to first organize your analysis folder as follows:
Make analysis root folder for this project, e.g. eQTLGen_phase2
.
mkdir eQTLGen_phase2
Inside this folder make another folder for Nextflow executable,
called tools
. This path is specified in all the template scripts,
so that scripts find Nextflow executable.
mkdir eQTLGen_phase2/tools
Inside eQTLGen_phase2/tools
download and self-install Nextflow
executable, as specified
here.
You might need to load Java >=11 before running self-install
(e.g. module load [Java >=11 module name in your HPC]
).
cd eQTLGen_phase2/tools
wget -qO- https://get.nextflow.io | bash
❗ If this does not work, it might be that your HPC has restrictions in connecting with internet. If this is the case please use these instructions for trying to install Nextflow offline. Please contact lead analysts in case additional help is needed.
Inside eQTLGen_phase2
, make additional separate folders for each
step of this analysis plan:
eQTLGen_phase2/1_DataQC
eQTLGen_phase2/2_Imputation
eQTLGen_phase2/3_ConvertVcf2Hdf5
eQTLGen_phase2/4_PerCohortPreparations
cd ../
mkdir 1_DataQC 2_Imputation 3_ConvertVcf2Hdf5 4_PerCohortPreparations
Inside each of those folders, you should clone/download the
corresponding pipeline. If git is available in your HPC, easiest is
to use command git clone [repo of the pipeline]
. This yields a
pipeline folder e.g. for the first pipeline it will look like that:
eQTLGen_phase2/1_DataQC/DataQC
.
cd 1_DataQC
git clone https://github.com/eQTLGen/DataQC.git
cd ../2_Imputation
git clone https://github.com/eQTLGen/eQTLGenImpute.git
cd ../3_ConvertVcf2Hdf5
git clone https://github.com/eQTLGen/ConvertVcf2Hdf5.git
cd ../4_PerCohortPreparations
git clone https://github.com/eQTLGen/PerCohortDataPreparations.git
We recommend to specify output
(and, if needed, input
etc.)
folder(s) for each step. E.g. eQTLGen_phase2/1_DataQC/output
.
cd ..
mkdir 1_DataQC/output 2_Imputation/output 3_ConvertVcf2Hdf5/output 4_PerCohortPreparations/output
Inside each pipeline folder is the script template using name format
submit_*_template.sh
e.g. eQTLGen_phase2/1_DataQC/DataQC/submit_DataQc_pipeline_template.sh
.
You should adjust this according to your data (e.g. specify the path
to your input folder, add required HPC modules), and save it. For
better tracking, I usually rename it too:
eQTLGen_phase2/1_DataQC/DataQC/submit_DataQc_pipeline_EstBB_HT12v3.sh
.
❗ To ease the process, we have pre-filled some paths in the templates, assuming that you use the recommended folder structure.
eQTLGen_phase2/1_DataQC/output/[Your Cohort Name]/outputfolder_gen/gen_data_QCd/
are the input for the second pipeline (imputation).In some HPCs the internet accessibility is restricted and this causes error messages when trying to install Nextflow and running it. This is because, by default, Nextflow executable tries to contact internet and check if newer version of Nextflow is available. Also, by default, first Data QC pipeline tries to automatically download plink executables and 1000G reference panel. In order to get the setup working offline, please follow these instructions. Please consult with lead analysts, in case abovementioned instructions are not solving the issues.
Example commands here are based on the Slurm scheduler. However, this can be easily changed to other schedulers.
In order to run Nextflow pipelines, you have to modify each template script and load a couple of modules which enable to run Nextflow and Singularity containers. These very common modules should be available in your HPC, however exact name and version might differ: Java >=11 and Singularity. You might also need a few extra modules to run Singularity without problems. If you think this might be the case, we recommend to check with your HPC documentation and/or tech support which modules are required. E.g. in the University of Tartu HPC there is also a module named squashFS needed to run Singularity without problems.
When you submit the job
e.g. sbatch submit_DataQc_pipeline_EstBB_HT12v3.sh
, this initiates
the pipeline, makes the analysis environment (using Singularity
container) and automatically submits all the pipeline steps in the
correct order and parallelized way. A separate work
directory is
made in the pipeline folder and will contain all the interim files
of the pipeline.
Monitoring:
Monitor the slurm-***.out
log file or the .nextflow.log
file
and check if all the steps finish without error. Trick: command
watch tail -n 20 slurm-***.out
helps you to interactively
monitor the status of the jobs.
Use squeue -u [YourUserName]
to see if individual tasks are in
the queue.
If the pipeline has crashed (e.g. due to wall time), you can just resubmit the same script after any required changes are implemented. Nextflow does not rerun completed steps and continues only from the steps which had not been completed.
When the work has finished, download and check the job report, for
potential errors or warnings. This file is automatically written to
your output folder pipeline_info
subfolder. E.g.
output/pipeline_info/DataQcReport.html
.
When you need to do some debugging, then you can use the last
section of the aforementioned report to figure out in which
subfolder from work
folder the actual step was run. You can then
navigate to this folder and investigate the following hidden files:
.command.sh
: script which was submitted
.command.log
: log file for seeing the analysis outputs/errors.
.command.err
: file which lists the errors, if any.
Hopefully those three files can give you some handle which might have gone wrong. Contact us, if you think you need any help to resolve the issue.
We have specified reasonable computational resources for each
pipeline step. However, in some cases these might work differently,
depending on HPC or dataset size. Resources for each pipeline step
are specified in config/base.conf
. In this file there are
resources (RAM, CPU, walltime) for each step of the pipeline, as
currently specified. You can try to increase those, if the pipeline
crashes due to insufficient resources. Also, you can adjust
cluster-specific options by modifying clusterOptions
flag for each
process. For example you can modify
clusterOptions = '--job-name=GenotypeQC'
into
clusterOptions = '--job-name=GenotypeQC -p long'
to ensure that
this specific process runs on the partition named long
.
For some analyses, work
directory can become quite large. So, when
the pipeline is successfully finished, and you have checked the
output, you can delete work
directory, in order to save this
storage space of your HPC. But this means that the pipeline will
restart from scratch, if you ever need to rerun this pipeline.
Nextflow and Singularity by default create cache folders in your
home directory. This may unexpectedly fill your home. Therefore, we
manually set and export the environment variables responsible for
cache folders within each nextflow pipeline to
../../singularitycache
and ../../nextflowcache
. If you followed
the naming scheme from the setup chapter and ran the
pipelines from within the pipeline directories, these evaluate to
eQTLGen_phase2/singularitycache
and
eQTLGen_phase2/nextflowcache
. Be aware that these might be changed
from your default if you’ve sourced the submit script from a
different directory.
For this analysis plan you need the following information and data:
An informative name for your dataset is one of the required inputs for
the majority of the pipelines in this cookbook. This could be the
combination of cohort name and expression platform. If you contribute
with multiple datasets from e.g. several batches, you could specify it
by adding _batch1
. If you contribute with multiple datasets from
several ancestries, then you could specify this by adding ancestry
e.g. _EUR
.
Example 1: For EstBB cohort I have 2 datasets EstBB_HT12v3
and
EstBB_RNAseq
.
Example 2: If there would be two EstBB RNAseq datasets, I would use:
EstBB_RNAseq_batch1
and EstBB_RNAseq_batch2
.
.fam
file should contain information about the
reported sex of the sample. This information is used in the
automatic data quality control step. If this information is not
available, you can still use those files but this specific quality
control step will be skipped.❗ Standard pre-imputation genotype and sample quality control will be done by our pipeline, so no need for thorough QC.
-
and contains either ENSEMBL
gene IDs (RNA-seq), Illumina ArrayAddress IDs for probes or probe
names for corresponding Affymetrix array.❗ For RNA-seq data we expect that the raw RNA-seq data has already been appropriately processed and QC’d prior to generating the count matrix (e.g. proper settings for the read mapping, etc.).
❗ For Affymetrix arrays we expect that all the standard data preprocessing steps for this array type have already been applied (e.g. RMA normalization). On top of those, our pipeline just applies inverse normal transformation on each gene (forces the gene expression distribution to be normal).
❗ For Affymetrix arrays we have currently used probe-to-gene linking as used in eQTLGen phase 1. It is possible that you have some other format for probe IDs: if this is the case, please contact us and we will update the pipeline accordingly.
If you participated in eQTLGen phase 1 analyses, this file should already be available.
As standard, within the dataQC step a number of default covariates are automatically constructed from both genotype data and expression data. Respectively, we calculate 10 genotype and 100 expression PCs from from the the data. The genotype-inferred sex is used as an additional covariate. Optionally, an additional set of covariates can be supplied, if you know that these are relevant for given dataset. For that please make a file with additional cohort-specific covariates:
SampleID
.We assume that potential sample mixups have already been identified, resolved or removed in the previous research. E.g. see details from here.
In this step, we prepare the genotype and gene expression data for the next steps. Specifically, the pipeline does the following steps:
❗ Because pre-phasing of genotypes benefits from larger sample sizes and many eQTL datasets have modest sample sizes (N<1,000), it is not advisable to prefilter the unimputed genotype data to include only eQTL samples. Our pipeline will extract eQTL samples itself (based on the genotype-to-expression file) and, if additional samples are available, includes additional up to 5,000 genotype samples which will be kept until the pre-phasing step of the analysis. If the genotype data is available for <5,000 samples, the pipeline uses only those samples which are available.
❗ For Affymetrix arrays- if you did not participate in eQTLGen phase 1, it might be that your probe name format does not align with our pipeline. Please contact with urmo.vosa at gmail.com and send some information, how the probe naming and annotation was done in your data. We will then update the pipeline accordingly.
❗ For Affymetrix arrays- if you did not participate in eQTLGen phase 1, it might be that your probe name format does not align with our pipeline. Please contact with urmo.vosa at gmail.com and send some information, how the probe naming and annotation was done in your data. We will then update the pipeline accordingly.
SampleID
.eQTLGen_phase2/1_DataQC
.DataQC
pipeline
git clone https://github.com/eQTLGen/DataQC.git
.output
.eQTLGen_phase2/1_DataQC/DataQC
and adjust
the script template with needed modules, input paths and settings.
Aside from the paths to your data files, you must specify:--exp_platform
).
Options: Illumina arrays: HT12v3
, HT12v4
or HuRef8
; RNA-seq:
RNAseq
; Affymetrix arrays: AffyU219
or AffyHumanExon
.--cohort_name
): your unique
dataset name used by all pipelines.--genome_build
): hg19
,
GRCh37
, hg38
or GRCh38
(GRCh37
prefilled).--bfile
argument with the --vcf
argument. The --vcf
argument expects a
full path to a vcf dataset. Pathname extension using globbing is
allowed (using *
or ?
), but the path should be provided without
pathway extension.-profile slurm,singularity
-profile pbs,singularity
-profile sge,singularity
--plink_executable
, --plink2_executable
and
--reference_1000g_folder
). Contact lead analysts for further
guidance, if needed.This is the template for Slurm scheduler
(submit_DataQC_pipeline_template.sh
):
#!/bin/bash
#SBATCH --time=48:00:00
#SBATCH -N 1
#SBATCH --ntasks-per-node=1
#SBATCH --mem=6G
#SBATCH --mail-type=BEGIN
#SBATCH --mail-type=END
#SBATCH --mail-type=FAIL
#SBATCH --job-name="DataQc"
# These are needed modules in UT HPC to get singularity and Nextflow running. Replace with appropriate ones for your HPC.
module load java-1.8.0_40
module load singularity/3.5.3
module load squashfs/4.4
# If you follow the eQTLGen phase II cookbook and analysis folder structure,
# some of the following paths are pre-filled.
# https://github.com/eQTLGen/eQTLGen-phase-2-cookbook/wiki/eQTLGen-phase-II-cookbook
# We set the following variables for nextflow to prevent writing to your home directory (and potentially filling it completely)
# Feel free to change these as you wish.
export SINGULARITY_CACHEDIR=../../singularitycache
export NXF_HOME=../../nextflowcache
# Disable pathname expansion. Nextflow handles pathname expansion by itself.
set -f
# Define paths
nextflow_path=../../tools # folder where Nextflow executable is
# Genotype data
bfile_path=[full path to your input genotype files without .bed/.bim/.fam extension]
# Other data
exp_path=[full path to your gene expression matrix]
gte_path=[full path to your genotype-to-expression file]
exp_platform=[expression platform name: HT12v3/HT12v4/HuRef8/RNAseq/AffyU219/AffyHumanExon]
cohort_name=[name of the cohort]
genome_build="GRCh37"
output_path=../output # Output path
# Additional settings and optional arguments for the command
# --GenOutThresh [numeric threshold]
# --GenSdThresh [numeric threshold]
# --ExpSdThresh [numeric threshold]
# --ContaminationSlope [number between 0 and 90, default 45]
# --ContaminationArea [number between 0 and 90, default 30]
# --AdditionalCovariates [file with additional covariates]
# --InclusionList [file with the list of samples to restrict the analysis]
# --ExclusionList [file with the list of samples to remove from the analysis]
# --preselected_sex_check_vars "data/Affy6_pruned_chrX_variant_positions.txt"
# --AdditionalCovariates [file with additional covariates. First column should be `SampleID`]
# --gen_qc_steps 'WGS'
# --fam [PLINK .fam file. Takes precedence over .fam file supplied with `--bfile`]
# --plink_executable [path to plink executable (PLINK v1.90b6.26 64-bit)]
# --plink2_executable [path to plink2 executable (PLINK v2.00a3.7LM 64-bit Intel)]
# --reference_1000g_folder [path to folder with 1000G reference data]
# Command:
NXF_VER=21.10.6 ${nextflow_path}/nextflow run DataQC.nf \
--bfile ${bfile_path} \
--expfile ${exp_path} \
--gte ${gte_path} \
--exp_platform ${exp_platform} \
--cohort_name ${cohort_name} \
--genome_build ${genome_build} \
--outdir ${output_path} \
-profile slurm,singularity \
-resume
Submit the job.
Go into the output folder, download and investigate the thorough
report file Report_DataQc_[your cohort name].html
. You will find
diagnostic plots and instructions on how to proceed from this
report.
According to the instructions in the QC report, it is likely that
you need to adjust some of the QC thresholds. In your slurm script,
you can add the some or all of the arguments --GenOutThresh
,
--GenSdThresh
, --ExpSdThresh
, --ContaminationSlope
,
--ContaminationArea
, and specify the appropriate values for each
of those.
In WGS datasets many samples might not fit the thresholds set on the
X-chromosome heterozygosity (F<0.2 and F>0.8). To resolve this
you can add the argument
--preselected_sex_check_vars ${sex_check_vars_path}
, which will
make the method use a preselected set of variants selected based on
array data.
In some cases, you might also need to split the data into several batches, according to the ancestry. Then you should run this and following pipelines for each of the batches separately.
Resubmit the job.
Go into the output folder, download and investigate the report file
Report_DataQc_[your cohort name].html
. Now QC plots should look as
expected.
If some of the plots still indicate issues, adjust the arguments and re-run the pipeline. You might need to do so a couple of times until there are no more apparent issues.
🎉 Done!
Pipeline gives output into the specified directory. The important files are following.
output/outputfolder_gen/gen_data_QCd/*_ToImputation.bed
,
output/outputfolder_gen/gen_data_QCd/*_ToImputation.bim
,
output/outputfolder_gen/gen_data_QCd/*_ToImputation.fam
are the
filtered and QCd genotype files which need to be the input for the
next, imputation pipeline. You need to
specify the path to the base name (no extension .bed/.bim/.fam
)
as an input.output
folder should be specified as one of the inputs
for per-cohort preparations
pipeline. This pipeline
automatically uses the processed, QCd expression data and covariate
file to run data encoding and partial derivative calculation. It
then organises the encoded matrices for sharing with the central
site. It also extracts some QC files, plots and summaries for
sharing with the central site.❗ While continuing with the next step, now it is also good time to send
the Report_DataQc_[your cohort name].html
to Urmo Võsa (urmo.vosa at
gmail.com) and Robert Warmerdam (c.a.warmerdam at umcg.nl). This way we
can have a initial look to the quality of the data and let you know if
something seems unoptimal.
In this step we impute the QC’d genotype data to the recently released 1000G 30X WGS reference panel.
Specifically, this pipeline performs the following steps:
.vcf
format..bed/.bim/.fam
. These
files are in the output of previous data QC pipeline.
These are in the folder output/outputfolder_gen/gen_data_QCd
. You
need to specify the path to the base name (no extension
.bed/.bim/.fam
) as an input.Go into eQTLGen_phase2/2_Imputation
.
Clone the eQTLGenImpute
pipeline
git clone https://github.com/eQTLGen/eQTLGenImpute.git
.
Make folder output
.
Download the zipped reference
wget https://www.dropbox.com/s/6g58ygjg9d2fvbi/eQTLGenReferenceFiles.tar.gz?dl=1
.
Because this file is ~30GB and can corrupt during the download, also
please download corresponding md5sum file
wget https://www.dropbox.com/s/ekfciajzevn6o1l/eQTLGenReferenceFiles.tar.gz.md5?dl=1
.
Check if download was successful
md5sum --check eQTLGenReferenceFiles.tar.gz.md5
. You should see
this message in your terminal: eQTLGenReferenceFiles.tar.gz:OK
.
Unzip the reference file tar -xfvz eQTLGenReferenceFiles.tar.gz
.
This yields the folder named hg38
which contains all needed
reference files.
Go into eQTLGen_phase2/2_Imputation/eQTLGenImpute
pipeline folder
and adjust the imputation script template with needed modules and
inputs
Path to QCd genotype data: path pointing to unimputed and QC’d
genotype files in plink .bed/.bim/.fam
format, ending with
*_ToImputation
. These files are in the output of previous data QC
pipeline, ending with *_ToImputation.*
and are in the
folder output/outputfolder_gen/gen_data_QCd
. This path is
pre-filled.
Path to reference folder (pre-filled).
Output path: to the folder output
you made (pre-filled).
Cohort name: this should be the same as in DataQC
.
Path to the folder with reference files (pre-filled).
The genome build of your dataset (--genome_build
): hg19
,
GRCh37
, hg38
or GRCh38
(hg19/GRCh37 by default).
You can select scheduler type by adjusting the profile as following:
-profile slurm,singularity
-profile pbs,singularity
-profile sge,singularity
This is the template for Slurm scheduler
(submit_imputation_pipeline_template.sh
):
#!/bin/bash
#SBATCH --time=72:00:00
#SBATCH -N 1
#SBATCH --ntasks-per-node=1
#SBATCH --mem=6G
#SBATCH --mail-type=BEGIN
#SBATCH --mail-type=END
#SBATCH --mail-type=FAIL
#SBATCH --job-name="ImputeGenotypes"
# These are needed modules in UT HPC to get Singularity and Nextflow running.
# Replace with appropriate ones for your HPC.
module load java/11.0.2
module load singularity/3.5.3
module load squashfs/4.4
# If you follow the eQTLGen phase II cookbook and analysis folder structure,
# some of the following paths are pre-filled.
# https://github.com/eQTLGen/eQTLGen-phase-2-cookbook/wiki/eQTLGen-phase-II-cookbook
# We set the following variables for nextflow to prevent writing to your home directory (and potentially filling it completely)
# Feel free to change these as you wish.
export SINGULARITY_CACHEDIR=../../singularitycache
export NXF_HOME=../../nextflowcache
# Define paths and arguments
nextflow_path=../../tools # folder where Nextflow executable is.
reference_path=../hg38 # folder where you unpacked the reference files.
cohort_name=[name of your cohort]
qc_input_folder=../../1_DataQC/output/ # folder with QCd genotype and expression data, output of DataQC pipeline.
output_path=../output/ # Output path.
genome_build="GRCh37"
# Command
NXF_VER=21.10.6 ${nextflow_path}/nextflow run eQTLGenImpute.nf \
--qcdata ${qc_input_folder} \
--target_ref ${reference_path}/genome_reference/Homo_sapiens.GRCh38.dna.primary_assembly.fa \
--ref_panel_hg38 ${reference_path}/harmonizing_reference/30x-GRCh38_NoSamplesSorted \
--eagle_genetic_map ${reference_path}/phasing_reference/genetic_map/genetic_map_hg38_withX.txt.gz \
--eagle_phasing_reference ${reference_path}/phasing_reference/phasing/ \
--minimac_imputation_reference ${reference_path}/imputation_reference/ \
--cohort_name ${cohort_name} \
--genome_build ${genome_build} \
--outdir ${output_path} \
-profile slurm,singularity \
-resume
Submit the job.
Check the output/pipeline_info/imputation_report.html
.
🎉 Done!
The pipeline gives output into the specified directory. The important files are following.
output/postimute/
should be specified as the input for the
next step, genotype conversion. This
folder contains imputed .vcf.gz
files, filtered by MAF>0.01.output/preimpute/
contains the formatted, filtered, and QCd
genotype data before imputation (.vcf.gz
). This is not directly
needed for this cookbook and might be used for other applications,
debugging the imputation, or just deleted.In this step, we convert the imputed genotype files from .vcf.gz
format into efficient .hdf5
format. This file format is native to HASE
and is needed for running the following HASE steps (encoding, partial
derivative computation) in a more efficient way.
Specifically, this pipeline performs the following steps:
.vcf.gz
files into chunks of 25,000 variants for faster
parallel processing..hdf5
format.These quality metrics will be used for performing additional per-variant QC in the central site.
vcf.gz
format.Go into eQTLGen_phase2/3_ConvertVcf2Hdf5
.
Clone the ConvertVcf2Hdf5
pipeline
git clone https://github.com/eQTLGen/ConvertVcf2Hdf5.git
.
Make folder output
Go into 3_ConvertVcf2Hdf5/ConvertVcf2Hdf5
and adjust the genotype
conversion script template with needed modules and inputs.
Imputed genotype path (pre-filled).
Output path (pre-filled).
You can select scheduler type by adjusting the profile as following:
-profile slurm,singularity
-profile pbs,singularity
-profile sge,singularity
This is the template for Slurm scheduler
(submit_genotype_conversion_template.sh
):
#!/bin/bash
#SBATCH --time=24:00:00
#SBATCH -N 1
#SBATCH --ntasks-per-node=1
#SBATCH --mem=5G
#SBATCH --job-name="ConvertVcf2Hdf5"
# These are needed modules in UT HPC to get Singularity and Nextflow running.
# Replace with appropriate ones for your HPC.
module load java/11.0.2
module load singularity/3.5.3
module load squashfs/4.4
# If you follow the eQTLGen phase II cookbook and analysis folder structure,
# some of the following paths are pre-filled.
# https://github.com/eQTLGen/eQTLGen-phase-2-cookbook/wiki/eQTLGen-phase-II-cookbook
# We set the following variables for nextflow to prevent writing to your home directory (and potentially filling it completely)
# Feel free to change these as you wish.
export SINGULARITY_CACHEDIR=../../singularitycache
export NXF_HOME=../../nextflowcache
nextflow_path=../../tools/ # Path to Nextflow executable, no need to adjust if folder structure is same as recommended in cookbook.
cohortname=[name of your cohort]
genopath=../../2_Imputation/output/postimpute # Folder with input genotype files in .vcf.gz format, no need to adjust if the folder structure is same as recommended in cookbook.
outputpath=../output/ # Path to output folder, no need to adjust if the folder structure is same as recommended in cookbook.
NXF_VER=21.10.6 ${nextflow_path}/nextflow run ConvertVcf2Hdf5.nf \
--vcf ${genopath} \
--outdir ${outputpath} \
--cohort_name ${cohortname} \
-profile slurm,singularity \
-resume
Submit the job.
Check the output/pipeline_info/ConvertGenotype_report.html
.
🎉 Done!
After successful completion of the pipeline, there should be hdf5
genotype file structure in your output folder, which, in addition to
pipeline_info
, contains folders named individuals
, probes
,
genotype
and SNPQC
. For eQTLGen phase 2 analyses, this folder is one
out of two inputs of per-cohort data
preparations pipeline.
This step prepares the data and performs all the steps which are needed for running encoded HASE meta-analysis in the central site.
Specifically, this pipeline performs the following steps:
md5sum
for all the shared files, so the integrity of
the upload can be checked..hdf5
format. This folder is the
output of genotype conversion pipeline.Go into eQTLGen_phase2/4_PerCohortPreparations
.
Make folder output
Clone the PerCohortPreparations
pipeline
git clone https://github.com/eQTLGen/PerCohortDataPreparations.git
.
Go into 4_PerCohortPreparations/PerCohortPreparations
.
Put genotype reference file into
4_PerCohortPreparations/PerCohortPreparations/bin/hase/data/
folder. This was already downloaded together with imputation
references in Imputation step. Therefore
you can copy it to the correct location like that:
cp ../../2_Imputation/hg38/hase_reference/* bin/hase/data/.
.
Adjust the template script.
Path to the genotypes in hdf5 format (pre-filled)
Path to QC’d expression data (pre-filled)
Output path (pre-filled)
You can select scheduler type by adjusting the profile as following:
-profile slurm,singularity
-profile pbs,singularity
-profile sge,singularity
This is the template script for Slurm scheduler
(submit_per_cohort_preparations_template.sh
):
#!/bin/bash
#SBATCH --time=24:00:00
#SBATCH -N 1
#SBATCH --ntasks-per-node=1
#SBATCH --mem=5G
#SBATCH --job-name="RunDataPreparations"
# These are needed modules in UT HPC to get Singularity and Nextflow running.
# Replace with appropriate ones for your HPC.
module load java/11.0.2
module load singularity/3.5.3
module load squashfs/4.4
# If you follow the eQTLGen phase II cookbook and analysis folder structure,
# some of the following paths are pre-filled.
# https://github.com/eQTLGen/eQTLGen-phase-2-cookbook/wiki/eQTLGen-phase-II-cookbook
# We set the following variables for nextflow to prevent writing to your home directory (and potentially filling it completely)
# Feel free to change these as you wish.
export SINGULARITY_CACHEDIR=../../singularitycache
export NXF_HOME=../../nextflowcache
nextflow_path=../../tools
genotypes_hdf5=../../3_ConvertVcf2Hdf5/output # Folder with genotype files in .hdf5 format
qc_data_folder=../../1_DataQC/output # Folder containing QCd data, inc. expression and covariates
output_path=../output
NXF_VER=21.10.6 ${nextflow_path}/nextflow run PerCohortDataPreparations.nf \
--hdf5 ${genotypes_hdf5} \
--qcdata ${qc_data_folder} \
--outdir ${output_path} \
-profile slurm,singularity \
-resume
Submit the job.
Check the
output/pipeline_info/PerCohortDataPreparations_report.html
.
🎉 Done!
output
folder there is subfolder called
[YourCohortName]_IntermediateFilesEncoded_to_upload
. This folder
contains all the non-personal files, logs, reports and should be
shared with the central site.output
folder there is also file
[YourCohortName]_IntermediateFilesEncoded_to_upload.md5
. This
should also be shared with the central site, in order to check the
integrity of uploaded files.The instructions for getting the SFTP account in UT HPC: send an e-mail to urmo.vosa at gmail.com and we will arrange the account. We will also instruct how to upload the needed data to SFTP server.
The output folder from per-cohort preparations
pipeline named
[YourCohortName]_IntermediateFilesEncoded_to_upload
and its
accompanying .md5
file should be uploaded into our sftp server.
❗ You might want to inform your HPC team about this project and the planned upload. Although the sheer file size is much smaller that would be for the classical meta-analysis, it is still ~50GB in case of 500 samples and RNA-seq. For the few largest datasets, the upload might be up to 500GB and might raise some red flags when monitoring the data traffic ;).
🎉 Thank you, you have shared the data necessary for global eQTL mapping! We will keep you updated and will re-contact if any additional information is needed.
Following pipelines are for performing global trans-eQTL meta-analysis in the central location and are here outlined FYI. If you are cohort analyst, no further action is needed. However, feel free to use this code and instructions to conduct further analyses in your respective cohort.
Pipeline for running encoded meta-analysis - this is for running encoded eQTL meta-analysis on one or several datasets.
Pipeline for extracting subsets of data from full summary statistics - this is for extracting data subsets for subsequent interpretation.
We have been working on more convenient output formats and pipelines that allow for quicker extractions of parts of the data. These updates are not yet included in the github repositories. These will be updated soon.
Here are some runtime benchmarks for each of the steps in this cookbook: these might help you to get some hint how much time running the pipelines might take.
All benchmarks are for RNA-seq dataset with following features:
.bed
genotype file: 1,052.bed
genotype file: 700,078output
directory: 344MBwork
subdirectory: 1.8GBNote: benchmarks to initial run with default settings, you probably need to re-run at least once after adjusting the settings.
.bed
genotype file:.bed
genotype file:output
directory: ~13GBwork
subdirectory: ~100GB.vcf.gz
genotype data: ~4,000.vcf.gz
genotype data: ~4,000output
directory: ~4GBwork
subdirectory: ~100GBoutput
directory: ~50GBwork
subdirectory: ~100GBThis cookbook utilizes HASE (https://github.com/roshchupkin/hase) and some of its helper scripts originally developed by:
Gennady V. Roscupkin (Department of Epidemiology, Radiology and Medical Informatics, Erasmus MC, Rotterdam, Netherlands)
Hieab H. Adams (Department of Epidemiology, Erasmus MC, Rotterdam, Netherlands).
Robert Warmerdam (Department of Genetics, University Medical Center
Groningen, University of Groningen, Groningen, Netherlands) modified the
original HASE and fixed some bugs. He also added classic-meta
option
to HASE analyses which enables to perform (encoded) inverse-variance
weighted meta-analysis and has started implementing methods for encoded
interaction analysis.
Urmo Võsa (Institute of Genomics, University of Tartu, Tartu, Estonia) incorporated it into Nextflow pipeline and applied some minor customization to the parts of the code.
Changes:
--intercept
option having no effect.self.chunk_size=10000 --> self.chunk_size=20000
.Original method paper for HASE framework:
1000G 30X imputation reference:
Nextflow workflow management tool:
For this cookbook: Urmo Võsa (urmo.vosa at gmail.com) and Robert Warmerdam (c.a.warmerdam at umcg.nl)
For the method of HASE, please find contacts from here: https://github.com/roshchupkin/hase