Requirements
Kmerworkflow is developed to work mostly on an HPC distributed cluster but a local, single machine, installation is also possible.
Install Kmerworkflow PyPI package
First, install the Kmerworkflow python package with pip.
git clone https://github.com/thdurand4/KMER_WORKFLOW.git
cd KMER_WORKFLOW
python3 -m pip install .
Kmerworkflow --help
Now, follow this documentation according to what you want, local or HPC mode.
Steps for LOCAL installation
Install Kmerworkflow in a local (single machine) mode using Kmerworkflow install_local command line.
install_local
Run installation for running in local computer.
The process downloading singularity images automatically.
Kmerworkflow install_local [OPTIONS]
Options
- --bash_completion, --no-bash_completion
Allow bash completion of Kmerworkflow commands on the bashrc file
- Default:
True
To create a pipeline, tools used by Kmerworkflow are wrapped into Singularity images. These images are automatically downloaded and used by the configuration files of the pipeline. Local mode install, without scheduler, is constrains to use these Singularity images.
the script uses the snakemake profiles to build the installation profile for Kmerworkflow. if –env is singularity, Kmerworkflow download images. Then, the script proposes to modify the following files to adapt to your system achitecture
Optionally (but recommended), after installing in local, you can check the Kmerworkflow installation using a dataset scaled for single machine. See the section Check install for details.
Steps for HPC distributed cluster installation
Kmerworkflow uses any available snakemake profiles to ease cluster installation and resources management. Run the command Kmerworkflow install_cluster to install on a HPC cluster. We tried to make cluster installation as easy as possible, but it is somehow necessary to adapt a few files according to your cluster environment.
install_cluster
Run installation of tool for HPC cluster
Kmerworkflow install_cluster [OPTIONS]
Options
- -s, --scheduler <scheduler>
Type the HPC scheduler (for the moment, only slurm is available ! )
- Default:
slurm
- Options:
slurm
- -e, --env <env>
Mode for tools dependencies
- Default:
modules
- Options:
modules | singularity
- --bash_completion, --no-bash_completion
Allow bash completion of Kmerworkflow commands on the bashrc file
- Default:
True
1. Adapt profile and cluster_config.yaml
Now that Kmerworkflow is installed, it proposes default configuration files, but they can be modified. Please check and adapt these files to your own system architecture.
1. Adapt the pre-formatted f –env si`snakemake profile` to configure your cluster options. See the section 1. Snakemake profiles for details.
2. Adapt the cluster_config.yaml file to manage cluster resources such as partition, memory and threads available for each job.
See the section 2. Adapting cluster_config.yaml for further details.
2. Adapt tools_path.yaml
As Kmerworkflow uses many tools, you must install them using one of the two following possibilities:
Kmerworkflow install_cluster --help
Kmerworkflow install_cluster --scheduler slurm --env modules
# OR
Kmerworkflow install_cluster --scheduler slurm --env singularity
If --env singularity argument is specified, Kmerworkflow will download previously build Singularity images, containing the complete environment need to run Kmerworkflow (tools and dependencies).
Adapt the file :file:tools_path.yaml - in YAML (Yet Another Markup Language) - format to indicate Kmerworkflow where the different tools are installed on your cluster.
See the section 3. How to configure tools_path.yaml for details.
Check install
In order to test your install of Kmerworkflow, a data test called data_test_Kmerworkflow/ is available at IN PROGRESS.
test_install
Test_install function downloads a scaled data test, writes a configuration file adapted to it and proposes a command line already to run !!!
Kmerworkflow test_install [OPTIONS]
Options
- -d, --data_dir <data_dir>
Required Path to download data test and create config.yaml to run test
This dataset will be automatically downloaded by Kmerworkflow in the -d directory using :
Kmerworkflow test_install -d test
Launching the (suggested, to be adapted) command line in CLUSTER mode will perform the tests:
Kmerworkflow run_cluster --config test/data_test_config.yaml
In local mode, type :
Kmerworkflow run_local -t 8 -c test/data_test_config.yaml --singularity-args "--bind $HOME"
Advance installation
1. Snakemake profiles
The Snakemake-profiles project is an open effort to create configuration profiles allowing to execute Snakemake in various computing environments (job scheduling systems as Slurm, SGE, Grid middleware, or cloud computing), and available at https://github.com/Snakemake-Profiles/doc.
In order to run Kmerworkflow on HPC cluster, we take advantages of profiles.
Quickly, see here an example of the Snakemake SLURM profile we used for the meso cluster.
More info about profiles can be found here https://github.com/Snakemake-Profiles/slurm#quickstart.
Preparing the profile’s config.yaml file
Once your basic profile is created, to finalize it, modify as necessary the KMER_WORKFLOW/Kmerworkflow/default_profile/config.yaml to customize Snakemake parameters that will be used internally by Kmerworkflow:
restart-times: 0
jobscript: "slurm-jobscript.sh"
cluster: "slurm-submit.py"
cluster-status: "slurm-status.py"
max-jobs-per-second: 1
max-status-checks-per-second: 10
local-cores: 1
jobs: 200 # edit to limit the number of jobs submitted in parallel
latency-wait: 60000000
use-envmodules: true # adapt True/False for env of singularuty, but only active one possibility !
use-singularity: false # if False, please install all R packages listed in tools_config.yaml ENVMODULE/R
rerun-incomplete: true
printshellcmds: true
2. Adapting cluster_config.yaml
In the cluster_config.yaml file, you can manage HPC resources, choosing partition, memory and threads to be used by default,
or specifically, for each rule/tool depending on your HPC Job Scheduler (see there). This file generally belongs to a Snakemake profile.
Example of cluster_config_slurm.yaml :
__default__:
cpus-per-task : 4
mem-per-cpu : 10G
partition : agap_short
output: '{log.output}_cluster'
error: '{log.error}_cluster'
job-name : "{rule}.{wildcards}"
check_nb_reads:
cpus-per-task: 1
partition: agap_short
mem-per-cpu : 10G
sub_set:
cpus-per-task: 3
partition: agap_short
mem-per-cpu : 10G
create_list_fastq:
cpus-per-task: 1
partition: agap_short
mem-per-cpu : 1G
Warning
If more memory or threads are requested, please adapt the content of this file before running on your cluster.
A list of Kmerworkflow rules names can be found in the section Threading rules inside Kmerworkflow
Warning
For some rules in the cluster_config.yaml as rule_graph or run_get_versions, we use by default wildcards, please don’t remove it.
3. How to configure tools_path.yaml
In the tools_path file, you can find two sections: SINGULARITY and ENVMODULES. In order to fill it correctly, you have 2 options:
1. Use only SINGULARITY containers: in this case, fill only this section. Put the path to the built Singularity images you want to use. Absolute paths are strongly recommended. See the section ‘How to build singularity images’ for further details.
SINGULARITY:
TOOLS : 'docker://thdurand/kmer_workflow:1.0' #DON T CHANGE THIS LINE IF YOU USE SINGULARITY !!
PERL : 'docker://thdurand/perl:1.0' #DON T CHANGE THIS LINE IF YOU USE SINGULARITY !!
Warning
To ensure SINGULARITY containers to be really used, one needs to make sure that the –use-singularity flag is included in the snakemake command line.
Use only ENVMODULES: in this case, fill this section with the modules available on your cluster (here is an example):
# Is and exemple of tools path
ENVMODULE:
PYTHON : "python/3.8.2" #Work with python version >= 3.8.2
SEQTK : "seqtk" #Work with seqtk version >= 1.3-r106
KAT : "kat" #Work with kat version >= 2.4.2
JELLYFISH : "jellyfish/2.3.0" #Work with jellyfish version >=2.3.0
PERL : "perllib/5.16.3" #Work with perl libraries GD::Simple - GD::SVG - Data::Dumper - Getopt::Long
Warning
Make sure to specify the –use-envmodules flag in the snakemake command line for ENVMODULE to be implemented. More details can be found here: https://snakemake.readthedocs.io/en/stable/snakefiles/deployment.html#using-environment-modules
Threading rules inside Kmerworkflow
Please find here the rules names found in Kmerworkflow code. It is recommended to set threads using the snakemake command when running on a single machine, or in a cluster configuration file to manage cluster resources through the job scheduler. This would save users a painful exploration of the snakefiles of Kmerworkflow.
check_nb_reads
sub_set
create_list_fastq
cat_fastq
kmer_count
binary_to_tbl
cut_coverage
regroup_kmer
sorted_kmer
split_kmer_by_line
sub_table
merge_split
merge_final
intersection_table
upset_plot