In some HPC setups, access to internet is restricted. This document helps you to get eQTLGen phase II cookbook running when there is no internet access. Please first have a look to analysis setup, make those preparations which are possible, and then continue with the instructions below.
When you try to install Nextflow as specified here, it automatically tries to connect internet and check if update is available. In case of firewall or connection restrictions, you might see error message. In order to run Nextflow offline, follow these steps:
cd eQTLGen_phase2/tools
wget https://github.com/nextflow-io/nextflow/releases/download/v21.10.6/nextflow-21.10.6-all
mv nextflow-21.10.6-all nextflow
chmod +x nextflow
echo "export NXF_OFFLINE='TRUE'" >> ~/.bashrc
module load [Java >=11 module name in your HPC]
This might take some memory, so first initiate interactive job (below example for SLURM):
srun --nodes=1 --ntasks-per-node=1 --time=01:00:00 --pty bash -i
Or alternatively, perform test run as usual scheduled job.
Run the command:
./nextflow run hello
You should see the message with Nextflow version number 21.10.6.
Our Nextflow pipelines use Singularity for software dependency management. By default, the pipelines try to download corresponding container images from our online repo. However, if internet access is restricted, you will see the error. Therefore, we provide the container images here via separate dropbox link, so you can download those via whicever way is possible and avoid automatic download by pipeline.
cd eQTLGen_phase2
wget "https://www.dropbox.com/scl/fi/sikrvqoftp74iuq951i41/eQTLGenP2OfflineFiles.tar.gz?rlkey=lgnqv1v84v9btks7jy42u7von&dl=1" -O eQTLGenP2OfflineFiles.tar.gz
wget "https://www.dropbox.com/scl/fi/0fdidje513qxsza4kynar/eQTLGenP2OfflineFiles.tar.gz.md5?rlkey=6679cowtq9q2crjlmt9enz7xk&dl=1" -O eQTLGenP2OfflineFiles.tar.gz.md5
md5sum -c eQTLGenP2OfflineFiles.tar.gz.md5
tar –xvzf eQTLGenP2OfflineFiles.tar.gz
In the unzipped folder there are folders with docker images for each
pipeline. After you have cloned each pipeline from github (as specified
here), you should copy/move each
singularity_img
folder inside the corresponding pipeline folder.
cp -R eQTLGenP2OfflineFiles/1_DataQC_additional_files/singularity_img 1_DataQC/DataQC/.
mv eQTLGenP2OfflineFiles/2_Imputation_additional_files/singularity_img 2_Imputation/eQTLGenImpute/.
mv eQTLGenP2OfflineFiles/1_DataQC_additional_files/singularity_img/* 2_Imputation/eQTLGenImpute/singularity_img/.
mv eQTLGenP2OfflineFiles/3_ConvertVcf2Hdf5_additional_files/singularity_img 3_ConvertVcf2Hdf5/ConvertVcf2Hdf5/.
mv eQTLGenP2OfflineFiles/4_PerCohortPreparations/singularity_img 4_PerCohortPreparations/PerCohortDataPreparations/.
Now Nextflow pipelines will use those images in the analysis, instead of trying to download those from online repo.
❗ If the genotype build for you input is not hg19 but hg38, you
additionally need couple of chain files for LiftOver: - Make
chain_folder
subfolder into your offline files folder:
mkdir eQTLGenP2OfflineFiles/chain_folder
- Download chain files from
UCSC web site:
wget https://hgdownload.cse.ucsc.edu/goldenpath/hg19/liftOver/hg19ToHg38.over.chain.gz -O eQTLGenP2OfflineFiles/chain_folder/hg19ToHg38.over.chain.gz
wget https://hgdownload.cse.ucsc.edu/goldenpath/hg38/liftOver/hg38ToHg19.over.chain.gz -O eQTLGenP2OfflineFiles/chain_folder/hg38ToHg19.over.chain.gz
❗ If the genotype build for you input is hg38, you need to
uncomment corresponding variable and add --chain_path
argument into
the Nextflow command in the template below.
By default, Data QC pipeline tries to download two plink executables and 1000G reference data from online. If download is not possible, it is instead possible to adjust the script template and provide corresponding files via additional arguments. Corresponding files are available in the folder you unpacked in the previous step.
In the following script template, additional arguments are pre-filled.
#!/bin/bash
#SBATCH --time=48:00:00
#SBATCH -N 1
#SBATCH --ntasks-per-node=1
#SBATCH --mem=6G
#SBATCH --mail-type=BEGIN
#SBATCH --mail-type=END
#SBATCH --mail-type=FAIL
#SBATCH --job-name="DataQc"
# These are needed modules in UT HPC to get singularity and Nextflow running. Replace with appropriate ones for your HPC.
module load java-1.8.0_40
module load singularity/3.5.3
module load squashfs/4.4
# If you follow the eQTLGen phase II cookbook and analysis folder structure,
# some of the following paths are pre-filled.
# https://github.com/eQTLGen/eQTLGen-phase-2-cookbook/wiki/eQTLGen-phase-II-cookbook
# We set the following variables for nextflow to prevent writing to your home directory (and potentially filling it completely)
# Feel free to change these as you wish.
export SINGULARITY_CACHEDIR=../../singularitycache
export NXF_HOME=../../nextflowcache
# Disable pathname expansion. Nextflow handles pathname expansion by itself.
set -f
# Define paths
nextflow_path=../../tools # folder where Nextflow executable is
# Genotype data
bfile_path=[full path to your input genotype files without .bed/.bim/.fam extension]
# Other data
exp_path=[full path to your gene expression matrix]
gte_path=[full path to your genotype-to-expression file]
exp_platform=[expression platform name: HT12v3/HT12v4/HuRef8/RNAseq/AffyU219/AffyHumanExon]
cohort_name=[name of the cohort]
genome_build="GRCh37"
output_path=../output # Output path
# Additional files for offline use
plink_executable=../../eQTLGenP2OfflineFiles/1_DataQC_additional_files/plink_executables/plink
plink2_executable=../../eQTLGenP2OfflineFiles/1_DataQC_additional_files/plink_executables/plink2
reference_1000g_folder=../../eQTLGenP2OfflineFiles/1_DataQC_additional_files/bigsnpr_reference/
# If your genotype build is hg38/GRCh38, uncomment this argument.
#chain_path=../../eQTLGenP2OfflineFiles/chain_folder
# Additional settings and optional arguments for the command
# --GenOutThresh [numeric threshold]
# --GenSdThresh [numeric threshold]
# --ExpSdThresh [numeric threshold]
# --ContaminationArea [number between 0 and 90, default 30]
# --AdditionalCovariates [file with additional covariates]
# --InclusionList [file with the list of samples to restrict the analysis]
# --ExclusionList [file with the list of samples to remove from the analysis]
# --preselected_sex_check_vars "data/Affy6_pruned_chrX_variant_positions.txt"
# --AdditionalCovariates [file with additional covariates. First column should be `SampleID`]
# --gen_qc_steps 'WGS'
# --fam [PLINK .fam file. Takes precedence over .fam file supplied with `--bfile`]
# --plink_executable [path to plink executable (PLINK v1.90b6.26 64-bit)]
# --plink2_executable [path to plink2 executable (PLINK v2.00a3.7LM 64-bit Intel)]
# --reference_1000g_folder [path to folder with 1000G reference data]
# --chain_path [path to folder hg19->hg38 and hg38->hg19 liftOver chain files]
# Command:
NXF_VER=21.10.6 ${nextflow_path}/nextflow run DataQC.nf \
--bfile ${bfile_path} \
--expfile ${exp_path} \
--gte ${gte_path} \
--exp_platform ${exp_platform} \
--cohort_name ${cohort_name} \
--genome_build ${genome_build} \
--outdir ${output_path} \
--plink_executable ${plink_executable} \
--plink2_executable ${plink2_executable} \
--reference_1000g_folder ${reference_1000g_folder} \
-profile slurm,singularity \
-resume