How to create a workflow 

Kmerworkflow allows you to build a workflow using a simple config.yaml configuration file :

First, provide the data paths
Second, manage parameters tools.
And last, give names to output table and choose optional rules.

To create this file, just run:

create_config 

Create config.yaml for run

Kmerworkflow create_config [OPTIONS]

Options

-c, --configyaml <configyaml>: Required Path to create config.yaml

Then, edit the relevant sections of the file to customize your flavor of a workflow.

1. Providing data 

First, indicate the data path in the config.yaml configuration file:

DATA:
    ############################################################################
    #Give information about path of accessions files , path of the scripts and path where results will be writen
    OUTPUT_DIR: "/home/durandt/scratch/WORKFLOW_KMER/WORKFLOW_KMER_V2/RESULTS_FASTQ_UNCONTAMINATED/"  #Directory where all the results will be written
    LIST_ACCESSION: "/home/durandt/scratch/WORKFLOW_KMER/WORKFLOW_KMER_V2/list_fastq_uncontaminate.txt"  #File needed to lauch the workflow it will contain "Accessions Path/to/fastq.gz black"
    LIST_COULEURS: "/home/durandt/scratch/WORKFLOW_KMER/WORKFLOW_KMER_V2/list_couleur"  #Give the path of the file that you need to make the graph to colors individual This file will be automaticaly created
    

Find here a summary table with description of each data need to launch Kmerworkflow :

Input	Description
LIST_ACCESSION	It’s a tabulate text file made by the users. It will lot of information about individuals and fastq files. See below
LIST_COULEURS	Give path to a file which permit to get color for the final graph of pipeline. THIS FILE WILL BE CREATED AUTOMATATICALY DON’T CREATE IT
OUTPUT_DIR	output path directory

Example of “LIST_ACCESSION” file :

Warning

MAKE SURE TO HAVE ONE LINE PER FASTQ FILE For this file make sur to separate fields with tabulate - The First field is the name of the individual - The Second is the path to your FASTQ file - The Third field is the number of reads you want to subsamplig. IF YOU HAVE PAIRED DATA AND YOU WANT TO SUBSAMPLING 20 MILLION OF READ MAKE SURE TO WRITE 10 MILLION FOR EACH PAIRED - The Fourth field is the seed to random subsampling. MAKE SUR TO HAVE THE SAME SEED FOR ALL _R1.fq AND AN OTHER SEED FOR ALL _R2.fq FOR EXAMPLE 100 FOR R1 AND 150 FOR R2 - The Last field is the color to set for the individuals. (the color will be on the graph at the end)

Erianthus_arundinaceus_EA001:JGI:Guangxi	/storage/replicated/cirad/projects/SEG/SUGARCANE/JGI_CSP_DIVERSITY_PROJECT/Clean_seq_data/FastQ_clean_files/Erianthus_arundinaceus_EA001_Resequencing_P1.fastq.gz	10000	100	black
Erianthus_arundinaceus_EA001:JGI:Guangxi	/storage/replicated/cirad/projects/SEG/SUGARCANE/JGI_CSP_DIVERSITY_PROJECT/Clean_seq_data/FastQ_clean_files/Erianthus_arundinaceus_EA001_Resequencing_P2.fastq.gz	10000	150	black
Erianthus_arundinaceus_IK76-48:JGI:SRA	/storage/replicated/cirad/projects/SEG/SUGARCANE/JGI_CSP_DIVERSITY_PROJECT/Clean_seq_data/FastQ_clean_files/Erianthus_arundinaceus_IK_76-48_Resequencing_P1.fastq.gz	10000	100	black
Erianthus_arundinaceus_IK76-48:JGI:SRA	/storage/replicated/cirad/projects/SEG/SUGARCANE/JGI_CSP_DIVERSITY_PROJECT/Clean_seq_data/FastQ_clean_files/Erianthus_arundinaceus_IK_76-48_Resequencing_P2.fastq.gz	10000	150	black
Erianthus_fulvus_EF001:JGI:Guangxi	/storage/replicated/cirad/projects/SEG/SUGARCANE/JGI_CSP_DIVERSITY_PROJECT/Clean_seq_data/FastQ_clean_files/Erianthus_fulvus_EF001_Resequencing_P1.fastq.gz	10000	100	black
Erianthus_fulvus_EF001:JGI:Guangxi	/storage/replicated/cirad/projects/SEG/SUGARCANE/JGI_CSP_DIVERSITY_PROJECT/Clean_seq_data/FastQ_clean_files/Erianthus_fulvus_EF001_Resequencing_P2.fastq.gz	10000	150	black
Miscanthus_Floridus_MiscFlo-PI295762:JGI:Houma	/storage/replicated/cirad/projects/SEG/SUGARCANE/JGI_CSP_DIVERSITY_PROJECT/Clean_seq_data/FastQ_clean_files/Miscanthus_Floridus_MiscFlo-PI295762_Resequencing_P1.fastq.gz	10000	100	black
Miscanthus_Floridus_MiscFlo-PI295762:JGI:Houma	/storage/replicated/cirad/projects/SEG/SUGARCANE/JGI_CSP_DIVERSITY_PROJECT/Clean_seq_data/FastQ_clean_files/Miscanthus_Floridus_MiscFlo-PI295762_Resequencing_P2.fastq.gz	10000	150	black
Miscanthus_sinense_JW484:JGI:JIRCAS	/storage/replicated/cirad/projects/SEG/SUGARCANE/JGI_CSP_DIVERSITY_PROJECT/Clean_seq_data/FastQ_clean_files/Miscanthus_sinense_JW484_Resequencing_P1.fastq.gz	10000	100	black
Miscanthus_sinense_JW484:JGI:JIRCAS	/storage/replicated/cirad/projects/SEG/SUGARCANE/JGI_CSP_DIVERSITY_PROJECT/Clean_seq_data/FastQ_clean_files/Miscanthus_sinense_JW484_Resequencing_P2.fastq.gz	10000	150	black
Miscanthus_sinense_NG7722:JGI:CIRAD	/storage/replicated/cirad/projects/SEG/SUGARCANE/JGI_CSP_DIVERSITY_PROJECT/Clean_seq_data/FastQ_clean_files/Miscanthus_sinense_NG7722_Resequencing_P1.fastq.gz	10000	100	black
Miscanthus_sinense_NG7722:JGI:CIRAD	/storage/replicated/cirad/projects/SEG/SUGARCANE/JGI_CSP_DIVERSITY_PROJECT/Clean_seq_data/FastQ_clean_files/Miscanthus_sinense_NG7722_Resequencing_P2.fastq.gz	10000	150	black
Narenga_porphyrocoma_Narenga:JGI:JIRCAS	/storage/replicated/cirad/projects/SEG/SUGARCANE/JGI_CSP_DIVERSITY_PROJECT/Clean_seq_data/FastQ_clean_files/Narenga_porphyrocoma_Narenga_Resequencing_P1.fastq.gz	10000	100	black
Narenga_porphyrocoma_Narenga:JGI:JIRCAS	/storage/replicated/cirad/projects/SEG/SUGARCANE/JGI_CSP_DIVERSITY_PROJECT/Clean_seq_data/FastQ_clean_files/Narenga_porphyrocoma_Narenga_Resequencing_P2.fastq.gz	10000	150	black
Narenga_sp_N001:JGI:Guangxi	/storage/replicated/cirad/projects/SEG/SUGARCANE/JGI_CSP_DIVERSITY_PROJECT/Clean_seq_data/FastQ_clean_files/Narenga_sp_N001_Resequencing_P1.fastq.gz	10000	100	black
Narenga_sp_N001:JGI:Guangxi	/storage/replicated/cirad/projects/SEG/SUGARCANE/JGI_CSP_DIVERSITY_PROJECT/Clean_seq_data/FastQ_clean_files/Narenga_sp_N001_Resequencing_P2.fastq.gz	10000	150	black
Saccharum_barberi_Chunnee:JGI:eRcane	/storage/replicated/cirad/projects/SEG/SUGARCANE/JGI_CSP_DIVERSITY_PROJECT/Clean_seq_data/FastQ_clean_files/Saccharum_barberi_Chunnee_Resequencing_P1.fastq.gz	10000	100	brown
Saccharum_barberi_Chunnee:JGI:eRcane	/storage/replicated/cirad/projects/SEG/SUGARCANE/JGI_CSP_DIVERSITY_PROJECT/Clean_seq_data/FastQ_clean_files/Saccharum_barberi_Chunnee_Resequencing_P2.fastq.gz	10000	150	brown
Saccharum_barberi_GANAPATHY:JGI:Miami	/storage/replicated/cirad/projects/SEG/SUGARCANE/JGI_CSP_DIVERSITY_PROJECT/Clean_seq_data/FastQ_clean_files/Saccharum_barberi_GANAPATHY_Resequencing_P1.fastq.gz	10000	100	brown
Saccharum_barberi_GANAPATHY:JGI:Miami	/storage/replicated/cirad/projects/SEG/SUGARCANE/JGI_CSP_DIVERSITY_PROJECT/Clean_seq_data/FastQ_clean_files/Saccharum_barberi_GANAPATHY_Resequencing_P2.fastq.gz	10000	150	brown

Warning

For FASTQ, naming convention is preferable by like NAME_R1.fastq.gz or NAME_R1.fq.gz or NAME_R1.fastq or NAME_R1.fq. Preferentially use short names and avoid special characters because report can fail. Avoid to use the long name given directly by sequencer. Same for _R2 All fastq files have to be homogeneous on their extension and can be compressed or not. Befor launch the pipeline it’s also preferable to check if your data doesn’t contains contamination

2. Parameters for some specific tools 

TOOLS_PARAMS:
    ##########################################################################
    #Choose what you want for tools
    KAT_HIST: "-t 4 --dump_hash --mer_len 50"  #KAT PARAMETERS
    JELLYFISH_DUMP: "-c -t" #JELLYFISH PARAMETERS Don't change the "-c" 
    CUT_COVERAGE: "10" #10 is default value it will check if the kmer is seen at minimum 10 times
    INTERSECT_TABLE: "--start 1 --end 400" #check script count_intersection.py to see all params possible 
    FULL_TABLE: "no" #yes OR no : If yes give name in OPTIONAL FULL_INTERSECT_NAME else do nothing

Find here a summary table with description of each params for Kmerworkflow :

Params	Description
KAT_HIST	Manage params of KAT tools
JELLYFISH_DUMP	Manage params of KAT tools
CUT_COVERAGE	Give the cutoff coverage that you want. If you write 10 the pipeline will work on kmer seen at the minimum 10 times.
INTERSECT_TABLE	Manage params of script count_intersection.py. See ‘Kmerworkflow/snakemake_scripts/count_intersection.py’ to check params of the script
FULL_TABLE	Its a params of script count_intersection.py just write Yes or no

3. Parameters for some specific tools and give name of output 

Activate/deactivate tools as you wish. Name output table of pipeline

Example:

OPTIONAL:
    ############################################################################
    #Choose the name of final tables
    FULL_INTERSECT_NAME: "fastq_uncontaminated_all"   #Give the name of the full intersection table
    PARSED_INTERSECT_NAME: "graph_1_400_fastq_uncontaminated" #Give the name of the table which will be use to make the graph
    UPSET_PLOT: "graph_kmer_uncontaminated"  #Give the name of the graph (upset-plot)
    CHECK_NB_READS: False

Warning

Please check documentation of each tool (outside of Kmerworkflow, and make sure that the settings are correct!)

How to run the workflow 

Before attempting to run Kmerworkflow, please verify that you have already modified the config.yaml file as explained in 1. Providing data.

If you installed Kmerworkflow on a HPC cluster with a job scheduler, you can run:

run_cluster 

Run snakemake command line with mandatory parameters.
SNAKEMAKE_OTHER: You can also pass additional Snakemake parameters using snakemake syntax.
These parameters will take precedence over Snakemake ones, which were defined in the profile.
See: https://snakemake.readthedocs.io/en/stable/executing/cli.html
Example:
Kmerworkflow run_cluster -c config.yaml –dry-run –jobs 200

Kmerworkflow run_cluster [OPTIONS] [SNAKEMAKE_OTHER]...

Options

-c, --config <config>: Required Configuration file for run tool

-pdf, --pdf

Run snakemake with –dag, –rulegraph and –filegraph

Default:: False

Arguments

SNAKEMAKE_OTHER: Optional argument(s)

Warning

MAKE SURE TO RUN THE WORKFLOW WITH THIS COMMAND : Kmerworkflow run_cluster -c config.yaml —rerun-triggers mtime

Warning

IF YOU CHOSE INSTALL WITH SINGULARITY RUN THIS COMMAND : Kmerworkflow run_cluster -c config.yaml —rerun-triggers mtime –singularity-args '–home ~/'

run_local 

Run snakemake command line with mandatory parameters.
SNAKEMAKE_OTHER: You can also pass additional Snakemake parameters using snakemake syntax.
These parameters will take precedence over Snakemake ones, which were defined in the profile.
See: https://snakemake.readthedocs.io/en/stable/executing/cli.html
Example:
Kmerworkflow run_local -c config.yaml –threads 8 –dry-run
Kmerworkflow run_local -c config.yaml –threads 8 –singularity-args ‘–bind /mnt:/mnt’

Kmerworkflow run_local [OPTIONS] [SNAKEMAKE_OTHER]...

Options

-c, --config <config>: Required Configuration file for run tool

-t, --threads <threads>: Required Number of threads

-p, --pdf: Run snakemake with –dag, –rulegraph and –filegraph

Arguments

SNAKEMAKE_OTHER: Optional argument(s)

Advance run 

Providing more resources 

If the cluster default resources are not sufficient, you can edit the cluster_config.yaml file. See 2. Adapting cluster_config.yaml:

edit_cluster_config

Edit cluster_config.yaml use by profile

Kmerworkflow edit_cluster_config [OPTIONS]

Providing your own tools_config.yaml 

To change the tools used in a Kmerworkflow workflow, you can see 3. How to configure tools_path.yaml

edit_tools

Edit own tools version

Kmerworkflow edit_tools [OPTIONS]

Options

-r, --restore

Restore default tools_config.yaml (from install)

Default:: False

Output on Kmerworkflow 

The architecture of Kmerworkflow output is designed as follow:

OUTPUT_Kmerworkflow/
├── 1_BIS_SUB_SET_READS
├── 1_MERGED_FASTQ
├── 2_KMER_COUNT
├── 3_MERGE_KMER
├── 4_SPLIT_KMER
├── 5_MERGE_TABLE
├── 6_INTERSECTION_TABLE
├── 7_UPSET_PLOT
└── LOGS