How to create a workflow
Kmerworkflow allows you to build a workflow using a simple config.yaml configuration file :
First, provide the data paths
Second, manage parameters tools.
And last, give names to output table and choose optional rules.
To create this file, just run:
create_config
Create config.yaml for run
Kmerworkflow create_config [OPTIONS]
Options
- -c, --configyaml <configyaml>
Required Path to create config.yaml
Then, edit the relevant sections of the file to customize your flavor of a workflow.
1. Providing data
First, indicate the data path in the config.yaml configuration file:
DATA:
############################################################################
#Give information about path of accessions files , path of the scripts and path where results will be writen
OUTPUT_DIR: "/home/durandt/scratch/WORKFLOW_KMER/WORKFLOW_KMER_V2/RESULTS_FASTQ_UNCONTAMINATED/" #Directory where all the results will be written
LIST_ACCESSION: "/home/durandt/scratch/WORKFLOW_KMER/WORKFLOW_KMER_V2/list_fastq_uncontaminate.txt" #File needed to lauch the workflow it will contain "Accessions Path/to/fastq.gz black"
LIST_COULEURS: "/home/durandt/scratch/WORKFLOW_KMER/WORKFLOW_KMER_V2/list_couleur" #Give the path of the file that you need to make the graph to colors individual This file will be automaticaly created
Find here a summary table with description of each data need to launch Kmerworkflow :
Input |
Description |
|---|---|
LIST_ACCESSION |
It’s a tabulate text file made by the users. It will lot of information about individuals and fastq files. See below |
LIST_COULEURS |
Give path to a file which permit to get color for the final graph of pipeline. THIS FILE WILL BE CREATED AUTOMATATICALY DON’T CREATE IT |
OUTPUT_DIR |
output path directory |
Example of “LIST_ACCESSION” file :
Warning
MAKE SURE TO HAVE ONE LINE PER FASTQ FILE For this file make sur to separate fields with tabulate - The First field is the name of the individual - The Second is the path to your FASTQ file - The Third field is the number of reads you want to subsamplig. IF YOU HAVE PAIRED DATA AND YOU WANT TO SUBSAMPLING 20 MILLION OF READ MAKE SURE TO WRITE 10 MILLION FOR EACH PAIRED - The Fourth field is the seed to random subsampling. MAKE SUR TO HAVE THE SAME SEED FOR ALL _R1.fq AND AN OTHER SEED FOR ALL _R2.fq FOR EXAMPLE 100 FOR R1 AND 150 FOR R2 - The Last field is the color to set for the individuals. (the color will be on the graph at the end)
Erianthus_arundinaceus_EA001:JGI:Guangxi /storage/replicated/cirad/projects/SEG/SUGARCANE/JGI_CSP_DIVERSITY_PROJECT/Clean_seq_data/FastQ_clean_files/Erianthus_arundinaceus_EA001_Resequencing_P1.fastq.gz 10000 100 black
Erianthus_arundinaceus_EA001:JGI:Guangxi /storage/replicated/cirad/projects/SEG/SUGARCANE/JGI_CSP_DIVERSITY_PROJECT/Clean_seq_data/FastQ_clean_files/Erianthus_arundinaceus_EA001_Resequencing_P2.fastq.gz 10000 150 black
Erianthus_arundinaceus_IK76-48:JGI:SRA /storage/replicated/cirad/projects/SEG/SUGARCANE/JGI_CSP_DIVERSITY_PROJECT/Clean_seq_data/FastQ_clean_files/Erianthus_arundinaceus_IK_76-48_Resequencing_P1.fastq.gz 10000 100 black
Erianthus_arundinaceus_IK76-48:JGI:SRA /storage/replicated/cirad/projects/SEG/SUGARCANE/JGI_CSP_DIVERSITY_PROJECT/Clean_seq_data/FastQ_clean_files/Erianthus_arundinaceus_IK_76-48_Resequencing_P2.fastq.gz 10000 150 black
Erianthus_fulvus_EF001:JGI:Guangxi /storage/replicated/cirad/projects/SEG/SUGARCANE/JGI_CSP_DIVERSITY_PROJECT/Clean_seq_data/FastQ_clean_files/Erianthus_fulvus_EF001_Resequencing_P1.fastq.gz 10000 100 black
Erianthus_fulvus_EF001:JGI:Guangxi /storage/replicated/cirad/projects/SEG/SUGARCANE/JGI_CSP_DIVERSITY_PROJECT/Clean_seq_data/FastQ_clean_files/Erianthus_fulvus_EF001_Resequencing_P2.fastq.gz 10000 150 black
Miscanthus_Floridus_MiscFlo-PI295762:JGI:Houma /storage/replicated/cirad/projects/SEG/SUGARCANE/JGI_CSP_DIVERSITY_PROJECT/Clean_seq_data/FastQ_clean_files/Miscanthus_Floridus_MiscFlo-PI295762_Resequencing_P1.fastq.gz 10000 100 black
Miscanthus_Floridus_MiscFlo-PI295762:JGI:Houma /storage/replicated/cirad/projects/SEG/SUGARCANE/JGI_CSP_DIVERSITY_PROJECT/Clean_seq_data/FastQ_clean_files/Miscanthus_Floridus_MiscFlo-PI295762_Resequencing_P2.fastq.gz 10000 150 black
Miscanthus_sinense_JW484:JGI:JIRCAS /storage/replicated/cirad/projects/SEG/SUGARCANE/JGI_CSP_DIVERSITY_PROJECT/Clean_seq_data/FastQ_clean_files/Miscanthus_sinense_JW484_Resequencing_P1.fastq.gz 10000 100 black
Miscanthus_sinense_JW484:JGI:JIRCAS /storage/replicated/cirad/projects/SEG/SUGARCANE/JGI_CSP_DIVERSITY_PROJECT/Clean_seq_data/FastQ_clean_files/Miscanthus_sinense_JW484_Resequencing_P2.fastq.gz 10000 150 black
Miscanthus_sinense_NG7722:JGI:CIRAD /storage/replicated/cirad/projects/SEG/SUGARCANE/JGI_CSP_DIVERSITY_PROJECT/Clean_seq_data/FastQ_clean_files/Miscanthus_sinense_NG7722_Resequencing_P1.fastq.gz 10000 100 black
Miscanthus_sinense_NG7722:JGI:CIRAD /storage/replicated/cirad/projects/SEG/SUGARCANE/JGI_CSP_DIVERSITY_PROJECT/Clean_seq_data/FastQ_clean_files/Miscanthus_sinense_NG7722_Resequencing_P2.fastq.gz 10000 150 black
Narenga_porphyrocoma_Narenga:JGI:JIRCAS /storage/replicated/cirad/projects/SEG/SUGARCANE/JGI_CSP_DIVERSITY_PROJECT/Clean_seq_data/FastQ_clean_files/Narenga_porphyrocoma_Narenga_Resequencing_P1.fastq.gz 10000 100 black
Narenga_porphyrocoma_Narenga:JGI:JIRCAS /storage/replicated/cirad/projects/SEG/SUGARCANE/JGI_CSP_DIVERSITY_PROJECT/Clean_seq_data/FastQ_clean_files/Narenga_porphyrocoma_Narenga_Resequencing_P2.fastq.gz 10000 150 black
Narenga_sp_N001:JGI:Guangxi /storage/replicated/cirad/projects/SEG/SUGARCANE/JGI_CSP_DIVERSITY_PROJECT/Clean_seq_data/FastQ_clean_files/Narenga_sp_N001_Resequencing_P1.fastq.gz 10000 100 black
Narenga_sp_N001:JGI:Guangxi /storage/replicated/cirad/projects/SEG/SUGARCANE/JGI_CSP_DIVERSITY_PROJECT/Clean_seq_data/FastQ_clean_files/Narenga_sp_N001_Resequencing_P2.fastq.gz 10000 150 black
Saccharum_barberi_Chunnee:JGI:eRcane /storage/replicated/cirad/projects/SEG/SUGARCANE/JGI_CSP_DIVERSITY_PROJECT/Clean_seq_data/FastQ_clean_files/Saccharum_barberi_Chunnee_Resequencing_P1.fastq.gz 10000 100 brown
Saccharum_barberi_Chunnee:JGI:eRcane /storage/replicated/cirad/projects/SEG/SUGARCANE/JGI_CSP_DIVERSITY_PROJECT/Clean_seq_data/FastQ_clean_files/Saccharum_barberi_Chunnee_Resequencing_P2.fastq.gz 10000 150 brown
Saccharum_barberi_GANAPATHY:JGI:Miami /storage/replicated/cirad/projects/SEG/SUGARCANE/JGI_CSP_DIVERSITY_PROJECT/Clean_seq_data/FastQ_clean_files/Saccharum_barberi_GANAPATHY_Resequencing_P1.fastq.gz 10000 100 brown
Saccharum_barberi_GANAPATHY:JGI:Miami /storage/replicated/cirad/projects/SEG/SUGARCANE/JGI_CSP_DIVERSITY_PROJECT/Clean_seq_data/FastQ_clean_files/Saccharum_barberi_GANAPATHY_Resequencing_P2.fastq.gz 10000 150 brown
Warning
For FASTQ, naming convention is preferable by like NAME_R1.fastq.gz or NAME_R1.fq.gz or NAME_R1.fastq or NAME_R1.fq. Preferentially use short names and avoid special characters because report can fail. Avoid to use the long name given directly by sequencer. Same for _R2 All fastq files have to be homogeneous on their extension and can be compressed or not. Befor launch the pipeline it’s also preferable to check if your data doesn’t contains contamination
2. Parameters for some specific tools
TOOLS_PARAMS:
##########################################################################
#Choose what you want for tools
KAT_HIST: "-t 4 --dump_hash --mer_len 50" #KAT PARAMETERS
JELLYFISH_DUMP: "-c -t" #JELLYFISH PARAMETERS Don't change the "-c"
CUT_COVERAGE: "10" #10 is default value it will check if the kmer is seen at minimum 10 times
INTERSECT_TABLE: "--start 1 --end 400" #check script count_intersection.py to see all params possible
FULL_TABLE: "no" #yes OR no : If yes give name in OPTIONAL FULL_INTERSECT_NAME else do nothing
Find here a summary table with description of each params for Kmerworkflow :
Params |
Description |
|---|---|
KAT_HIST |
Manage params of KAT tools |
JELLYFISH_DUMP |
Manage params of KAT tools |
CUT_COVERAGE |
Give the cutoff coverage that you want. If you write 10 the pipeline will work on kmer seen at the minimum 10 times. |
INTERSECT_TABLE |
Manage params of script count_intersection.py. See ‘Kmerworkflow/snakemake_scripts/count_intersection.py’ to check params of the script |
FULL_TABLE |
Its a params of script count_intersection.py just write Yes or no |
3. Parameters for some specific tools and give name of output
Activate/deactivate tools as you wish. Name output table of pipeline
Example:
OPTIONAL:
############################################################################
#Choose the name of final tables
FULL_INTERSECT_NAME: "fastq_uncontaminated_all" #Give the name of the full intersection table
PARSED_INTERSECT_NAME: "graph_1_400_fastq_uncontaminated" #Give the name of the table which will be use to make the graph
UPSET_PLOT: "graph_kmer_uncontaminated" #Give the name of the graph (upset-plot)
CHECK_NB_READS: False
Warning
Please check documentation of each tool (outside of Kmerworkflow, and make sure that the settings are correct!)
How to run the workflow
Before attempting to run Kmerworkflow, please verify that you have already modified the config.yaml file as explained in 1. Providing data.
If you installed Kmerworkflow on a HPC cluster with a job scheduler, you can run:
run_cluster
Kmerworkflow run_cluster [OPTIONS] [SNAKEMAKE_OTHER]...
Options
- -c, --config <config>
Required Configuration file for run tool
- -pdf, --pdf
Run snakemake with –dag, –rulegraph and –filegraph
- Default:
False
Arguments
- SNAKEMAKE_OTHER
Optional argument(s)
Warning
MAKE SURE TO RUN THE WORKFLOW WITH THIS COMMAND : Kmerworkflow run_cluster -c config.yaml —rerun-triggers mtime
Warning
IF YOU CHOSE INSTALL WITH SINGULARITY RUN THIS COMMAND : Kmerworkflow run_cluster -c config.yaml —rerun-triggers mtime –singularity-args '–home ~/'
run_local
Kmerworkflow run_local [OPTIONS] [SNAKEMAKE_OTHER]...
Options
- -c, --config <config>
Required Configuration file for run tool
- -t, --threads <threads>
Required Number of threads
- -p, --pdf
Run snakemake with –dag, –rulegraph and –filegraph
Arguments
- SNAKEMAKE_OTHER
Optional argument(s)
Advance run
Providing more resources
If the cluster default resources are not sufficient, you can edit the cluster_config.yaml file. See 2. Adapting cluster_config.yaml:
edit_cluster_config
Edit cluster_config.yaml use by profile
Kmerworkflow edit_cluster_config [OPTIONS]
Providing your own tools_config.yaml
To change the tools used in a Kmerworkflow workflow, you can see 3. How to configure tools_path.yaml
edit_tools
Edit own tools version
Kmerworkflow edit_tools [OPTIONS]
Options
- -r, --restore
Restore default tools_config.yaml (from install)
- Default:
False
Output on Kmerworkflow
The architecture of Kmerworkflow output is designed as follow:
OUTPUT_Kmerworkflow/
├── 1_BIS_SUB_SET_READS
├── 1_MERGED_FASTQ
├── 2_KMER_COUNT
├── 3_MERGE_KMER
├── 4_SPLIT_KMER
├── 5_MERGE_TABLE
├── 6_INTERSECTION_TABLE
├── 7_UPSET_PLOT
└── LOGS