KMER Logo

Today, get an assembly of large polyploid genome is very complicated and it’s hard to compare lot of individuals of this polyploid species, in particular to compare sequences of many individuals.

Due to the daily deluge of data sequences the number of data increase and need to be analyse …

So a huge question remains:

“How can i compare many individuals of a speicies without assembly?”

To that anguishing idea, we can answer: KMER_WORFLOW can help you!

KMER_WORKFLOW is is an open-source, scalable, modulable and traceable snakemake pipeline, able to compare multiple data short read (NGS) obtained from illumina sequencing by counting the number of shared kmers. The workflow KMER_WORKFLOW can help you to find which individuals share sequences informations to other.

KMER_WORKFLOW generates an upset plot (Graph) containing all information about how much kmer are shared by how much indivuduals and the sequences of them.

Sub_Sampling reads Illumina and count KMERS

The first step of KMER_WORKFLOW is to random sub sampling reads of all of individuals. The pipeline will take the number of reads to sub sampling for each paired and merged the result.

For example if you wante to sub sampling 10 Millions of reads of PAIRED data you need to precise 5 Millions of reads to sub sampling for each paired However if you have single data you can make 10 Milions on the unpaired data.

Warning

  • CONTAMINATION: BE CAREFUL MAKE SURE YOURS DATA DOESN’T CONTAINS CONTAMINATION MAYBE BEFORE LAUNCH PIPELINE USE TOOLS LIKE KRAKEN TO CHECK POSSIBLE CONTAMINATION.

  • NUMBER OF READS: MAKE SUR YOURS DATA CONTAINS ENOUGHT READS TO SUBSAMPLING. FOR EXEMPLE IF YOU HAVE _R1.fq.gz AND _R2.fq.gz AND YOU WANT SUB SAMPLING 10 MILLIONS MAKE SURE TO HAVE 5 MILLIONS IN BOTH.

Note

  • SEQTK: Seqtk is the tool use to subsampling data.

Included tools :

  • Seqtk version >= 1.3-r106

Next the pipeline will count KMERS of each individuals

Note

  • KAT HIST: K-mer Analysis Toolkit to count Kmer and get binary output.

  • JELLYFISH DUMP: Provide count of kmer in human readable format.

Included tools :

  • kat version >= 2.4.2

  • jellyfish version >= 2.3.0

Next the pipeline will make some steps to merge all information of count kmers of all individuals to make a final merged table of kmer count for each individuals

Theses steps use only bash command they are no included tools.

Warning

  • MEMORY: MAKE SURE TO ADAPT NUMBER OF THREADS AND MEMORY PER CPU IN THE cluster_config.yaml FILE IF ONE OF THIS STEP FAILED

Calculate intersection of KMERS between individuals

When the merged table of kmer is created the pipeline will use this table to calculate intersection of KMERS over all individuals and provides an other table to make the kmer graph (upset plot)

You can change parameters of this script in the config.yaml file.

Note

  • COUNT_INTERSECTION.PY: Is the custom python script which provides the table to make the upset plot

Included tools :

  • python versions >= 3.8.2

UPSET PLOT , KMER intersection graph

Last step of this pipeline is to make the upset plot

Note

  • GRAPH_KMER_V3.PL: Is the custom perl script which make the final result of the pipeline : UPSET PLOT OF SHARED KMER BETWEEN ALL INDIVIDUALS

Included tools :

  • perl versions >= 5.16.3

Warning

Make sure to install the next perl library : - GD::Simple - GD::SVG - Data::Dumper - Getopt::Long

Exemple of final graph of the pipeline :

The number next to the name of the individuals represents kmers only shared by himself (singleton). Next to this number there is a barplot to have a representation of the quantity of singleton

At the top of the graph we got one number which represent number of shared kmer for this columns and the XXX.fasta represents in which files are the sequences of shared KMERS

KMER_GRAPH

Directed acyclic graphs (DAGs) show all step of the pipeline:

dag