How to create a workflow

Kmerworkflow allows you to build a workflow using a simple config.yaml configuration file :

  • First, provide the data paths

  • Second, manage parameters tools.

  • And last, give names to output table and choose optional rules.

To create this file, just run:

create_config

Create config.yaml for run

Kmerworkflow create_config [OPTIONS]

Options

-c, --configyaml <configyaml>

Required Path to create config.yaml

Then, edit the relevant sections of the file to customize your flavor of a workflow.

1. Providing data

First, indicate the data path in the config.yaml configuration file:

DATA:
    ############################################################################
    #Give information about path of accessions files , path of the scripts and path where results will be writen
    OUTPUT_DIR: "/home/durandt/scratch/WORKFLOW_KMER/WORKFLOW_KMER_V2/RESULTS_FASTQ_UNCONTAMINATED/"  #Directory where all the results will be written
    LIST_ACCESSION: "/home/durandt/scratch/WORKFLOW_KMER/WORKFLOW_KMER_V2/list_fastq_uncontaminate.txt"  #File needed to lauch the workflow it will contain "Accessions Path/to/fastq.gz black"
    LIST_COULEURS: "/home/durandt/scratch/WORKFLOW_KMER/WORKFLOW_KMER_V2/list_couleur"  #Give the path of the file that you need to make the graph to colors individual This file will be automaticaly created
    

Find here a summary table with description of each data need to launch Kmerworkflow :

Input

Description

LIST_ACCESSION

It’s a tabulate text file made by the users. It will lot of information about individuals and fastq files. See below

LIST_COULEURS

Give path to a file which permit to get color for the final graph of pipeline. THIS FILE WILL BE CREATED AUTOMATATICALY DON’T CREATE IT

OUTPUT_DIR

output path directory

Example of “LIST_ACCESSION” file :

Warning

MAKE SURE TO HAVE ONE LINE PER FASTQ FILE For this file make sur to separate fields with tabulate - The First field is the name of the individual - The Second is the path to your FASTQ file - The Third field is the number of reads you want to subsamplig. IF YOU HAVE PAIRED DATA AND YOU WANT TO SUBSAMPLING 20 MILLION OF READ MAKE SURE TO WRITE 10 MILLION FOR EACH PAIRED - The Fourth field is the seed to random subsampling. MAKE SUR TO HAVE THE SAME SEED FOR ALL _R1.fq AND AN OTHER SEED FOR ALL _R2.fq FOR EXAMPLE 100 FOR R1 AND 150 FOR R2 - The Last field is the color to set for the individuals. (the color will be on the graph at the end)

Erianthus_arundinaceus_EA001:JGI:Guangxi	/storage/replicated/cirad/projects/SEG/SUGARCANE/JGI_CSP_DIVERSITY_PROJECT/Clean_seq_data/FastQ_clean_files/Erianthus_arundinaceus_EA001_Resequencing_P1.fastq.gz	10000	100	black
Erianthus_arundinaceus_EA001:JGI:Guangxi	/storage/replicated/cirad/projects/SEG/SUGARCANE/JGI_CSP_DIVERSITY_PROJECT/Clean_seq_data/FastQ_clean_files/Erianthus_arundinaceus_EA001_Resequencing_P2.fastq.gz	10000	150	black
Erianthus_arundinaceus_IK76-48:JGI:SRA	/storage/replicated/cirad/projects/SEG/SUGARCANE/JGI_CSP_DIVERSITY_PROJECT/Clean_seq_data/FastQ_clean_files/Erianthus_arundinaceus_IK_76-48_Resequencing_P1.fastq.gz	10000	100	black
Erianthus_arundinaceus_IK76-48:JGI:SRA	/storage/replicated/cirad/projects/SEG/SUGARCANE/JGI_CSP_DIVERSITY_PROJECT/Clean_seq_data/FastQ_clean_files/Erianthus_arundinaceus_IK_76-48_Resequencing_P2.fastq.gz	10000	150	black
Erianthus_fulvus_EF001:JGI:Guangxi	/storage/replicated/cirad/projects/SEG/SUGARCANE/JGI_CSP_DIVERSITY_PROJECT/Clean_seq_data/FastQ_clean_files/Erianthus_fulvus_EF001_Resequencing_P1.fastq.gz	10000	100	black
Erianthus_fulvus_EF001:JGI:Guangxi	/storage/replicated/cirad/projects/SEG/SUGARCANE/JGI_CSP_DIVERSITY_PROJECT/Clean_seq_data/FastQ_clean_files/Erianthus_fulvus_EF001_Resequencing_P2.fastq.gz	10000	150	black
Miscanthus_Floridus_MiscFlo-PI295762:JGI:Houma	/storage/replicated/cirad/projects/SEG/SUGARCANE/JGI_CSP_DIVERSITY_PROJECT/Clean_seq_data/FastQ_clean_files/Miscanthus_Floridus_MiscFlo-PI295762_Resequencing_P1.fastq.gz	10000	100	black
Miscanthus_Floridus_MiscFlo-PI295762:JGI:Houma	/storage/replicated/cirad/projects/SEG/SUGARCANE/JGI_CSP_DIVERSITY_PROJECT/Clean_seq_data/FastQ_clean_files/Miscanthus_Floridus_MiscFlo-PI295762_Resequencing_P2.fastq.gz	10000	150	black
Miscanthus_sinense_JW484:JGI:JIRCAS	/storage/replicated/cirad/projects/SEG/SUGARCANE/JGI_CSP_DIVERSITY_PROJECT/Clean_seq_data/FastQ_clean_files/Miscanthus_sinense_JW484_Resequencing_P1.fastq.gz	10000	100	black
Miscanthus_sinense_JW484:JGI:JIRCAS	/storage/replicated/cirad/projects/SEG/SUGARCANE/JGI_CSP_DIVERSITY_PROJECT/Clean_seq_data/FastQ_clean_files/Miscanthus_sinense_JW484_Resequencing_P2.fastq.gz	10000	150	black
Miscanthus_sinense_NG7722:JGI:CIRAD	/storage/replicated/cirad/projects/SEG/SUGARCANE/JGI_CSP_DIVERSITY_PROJECT/Clean_seq_data/FastQ_clean_files/Miscanthus_sinense_NG7722_Resequencing_P1.fastq.gz	10000	100	black
Miscanthus_sinense_NG7722:JGI:CIRAD	/storage/replicated/cirad/projects/SEG/SUGARCANE/JGI_CSP_DIVERSITY_PROJECT/Clean_seq_data/FastQ_clean_files/Miscanthus_sinense_NG7722_Resequencing_P2.fastq.gz	10000	150	black
Narenga_porphyrocoma_Narenga:JGI:JIRCAS	/storage/replicated/cirad/projects/SEG/SUGARCANE/JGI_CSP_DIVERSITY_PROJECT/Clean_seq_data/FastQ_clean_files/Narenga_porphyrocoma_Narenga_Resequencing_P1.fastq.gz	10000	100	black
Narenga_porphyrocoma_Narenga:JGI:JIRCAS	/storage/replicated/cirad/projects/SEG/SUGARCANE/JGI_CSP_DIVERSITY_PROJECT/Clean_seq_data/FastQ_clean_files/Narenga_porphyrocoma_Narenga_Resequencing_P2.fastq.gz	10000	150	black
Narenga_sp_N001:JGI:Guangxi	/storage/replicated/cirad/projects/SEG/SUGARCANE/JGI_CSP_DIVERSITY_PROJECT/Clean_seq_data/FastQ_clean_files/Narenga_sp_N001_Resequencing_P1.fastq.gz	10000	100	black
Narenga_sp_N001:JGI:Guangxi	/storage/replicated/cirad/projects/SEG/SUGARCANE/JGI_CSP_DIVERSITY_PROJECT/Clean_seq_data/FastQ_clean_files/Narenga_sp_N001_Resequencing_P2.fastq.gz	10000	150	black
Saccharum_barberi_Chunnee:JGI:eRcane	/storage/replicated/cirad/projects/SEG/SUGARCANE/JGI_CSP_DIVERSITY_PROJECT/Clean_seq_data/FastQ_clean_files/Saccharum_barberi_Chunnee_Resequencing_P1.fastq.gz	10000	100	brown
Saccharum_barberi_Chunnee:JGI:eRcane	/storage/replicated/cirad/projects/SEG/SUGARCANE/JGI_CSP_DIVERSITY_PROJECT/Clean_seq_data/FastQ_clean_files/Saccharum_barberi_Chunnee_Resequencing_P2.fastq.gz	10000	150	brown
Saccharum_barberi_GANAPATHY:JGI:Miami	/storage/replicated/cirad/projects/SEG/SUGARCANE/JGI_CSP_DIVERSITY_PROJECT/Clean_seq_data/FastQ_clean_files/Saccharum_barberi_GANAPATHY_Resequencing_P1.fastq.gz	10000	100	brown
Saccharum_barberi_GANAPATHY:JGI:Miami	/storage/replicated/cirad/projects/SEG/SUGARCANE/JGI_CSP_DIVERSITY_PROJECT/Clean_seq_data/FastQ_clean_files/Saccharum_barberi_GANAPATHY_Resequencing_P2.fastq.gz	10000	150	brown

Warning

For FASTQ, naming convention is preferable by like NAME_R1.fastq.gz or NAME_R1.fq.gz or NAME_R1.fastq or NAME_R1.fq. Preferentially use short names and avoid special characters because report can fail. Avoid to use the long name given directly by sequencer. Same for _R2 All fastq files have to be homogeneous on their extension and can be compressed or not. Befor launch the pipeline it’s also preferable to check if your data doesn’t contains contamination

2. Parameters for some specific tools

TOOLS_PARAMS:
    ##########################################################################
    #Choose what you want for tools
    KAT_HIST: "-t 4 --dump_hash --mer_len 50"  #KAT PARAMETERS
    JELLYFISH_DUMP: "-c -t" #JELLYFISH PARAMETERS Don't change the "-c" 
    CUT_COVERAGE: "10" #10 is default value it will check if the kmer is seen at minimum 10 times
    INTERSECT_TABLE: "--start 1 --end 400" #check script count_intersection.py to see all params possible 
    FULL_TABLE: "no" #yes OR no : If yes give name in OPTIONAL FULL_INTERSECT_NAME else do nothing

Find here a summary table with description of each params for Kmerworkflow :

Params

Description

KAT_HIST

Manage params of KAT tools

JELLYFISH_DUMP

Manage params of KAT tools

CUT_COVERAGE

Give the cutoff coverage that you want. If you write 10 the pipeline will work on kmer seen at the minimum 10 times.

INTERSECT_TABLE

Manage params of script count_intersection.py. See ‘Kmerworkflow/snakemake_scripts/count_intersection.py’ to check params of the script

FULL_TABLE

Its a params of script count_intersection.py just write Yes or no

3. Parameters for some specific tools and give name of output

Activate/deactivate tools as you wish. Name output table of pipeline

Example:

OPTIONAL:
    ############################################################################
    #Choose the name of final tables
    FULL_INTERSECT_NAME: "fastq_uncontaminated_all"   #Give the name of the full intersection table
    PARSED_INTERSECT_NAME: "graph_1_400_fastq_uncontaminated" #Give the name of the table which will be use to make the graph
    UPSET_PLOT: "graph_kmer_uncontaminated"  #Give the name of the graph (upset-plot)
    CHECK_NB_READS: False

Warning

Please check documentation of each tool (outside of Kmerworkflow, and make sure that the settings are correct!)


How to run the workflow

Before attempting to run Kmerworkflow, please verify that you have already modified the config.yaml file as explained in 1. Providing data.

If you installed Kmerworkflow on a HPC cluster with a job scheduler, you can run:

run_cluster

Run snakemake command line with mandatory parameters.
SNAKEMAKE_OTHER: You can also pass additional Snakemake parameters using snakemake syntax.
These parameters will take precedence over Snakemake ones, which were defined in the profile.
Example:
Kmerworkflow run_cluster -c config.yaml –dry-run –jobs 200
Kmerworkflow run_cluster [OPTIONS] [SNAKEMAKE_OTHER]...

Options

-c, --config <config>

Required Configuration file for run tool

-pdf, --pdf

Run snakemake with –dag, –rulegraph and –filegraph

Default:

False

Arguments

SNAKEMAKE_OTHER

Optional argument(s)

Warning

MAKE SURE TO RUN THE WORKFLOW WITH THIS COMMAND : Kmerworkflow run_cluster -c config.yaml —rerun-triggers mtime

Warning

IF YOU CHOSE INSTALL WITH SINGULARITY RUN THIS COMMAND : Kmerworkflow run_cluster -c config.yaml —rerun-triggers mtime –singularity-args '–home ~/'


run_local

Run snakemake command line with mandatory parameters.
SNAKEMAKE_OTHER: You can also pass additional Snakemake parameters using snakemake syntax.
These parameters will take precedence over Snakemake ones, which were defined in the profile.
Example:
Kmerworkflow run_local -c config.yaml –threads 8 –dry-run
Kmerworkflow run_local -c config.yaml –threads 8 –singularity-args ‘–bind /mnt:/mnt’
Kmerworkflow run_local [OPTIONS] [SNAKEMAKE_OTHER]...

Options

-c, --config <config>

Required Configuration file for run tool

-t, --threads <threads>

Required Number of threads

-p, --pdf

Run snakemake with –dag, –rulegraph and –filegraph

Arguments

SNAKEMAKE_OTHER

Optional argument(s)


Advance run

Providing more resources

If the cluster default resources are not sufficient, you can edit the cluster_config.yaml file. See 2. Adapting cluster_config.yaml:

edit_cluster_config

Edit cluster_config.yaml use by profile

Kmerworkflow edit_cluster_config [OPTIONS]

Providing your own tools_config.yaml

To change the tools used in a Kmerworkflow workflow, you can see 3. How to configure tools_path.yaml

edit_tools

Edit own tools version

Kmerworkflow edit_tools [OPTIONS]

Options

-r, --restore

Restore default tools_config.yaml (from install)

Default:

False


Output on Kmerworkflow

The architecture of Kmerworkflow output is designed as follow:

OUTPUT_Kmerworkflow/
├── 1_BIS_SUB_SET_READS
├── 1_MERGED_FASTQ
├── 2_KMER_COUNT
├── 3_MERGE_KMER
├── 4_SPLIT_KMER
├── 5_MERGE_TABLE
├── 6_INTERSECTION_TABLE
├── 7_UPSET_PLOT
└── LOGS