Flanker is a tool for studying the homology of gene-flanking sequences. It will annotate FASTA/multi-FASTA files for specified genes, then write the flanking sequences to new FASTA files. There is also an optional step to cluster the flanks by sequence identity.
If you use Flanker in your work please cite our pre-print:
Matlock W, Lipworth S, Constantinides B, Peto TEA, Walker AS, Crook D, Hopkins S, Shaw LP, Stoesser N. Flanker: a tool for comparative genomics of gene flanking regions. BioRxiv. 2021. doi: https://doi.org/10.1101/2021.02.22.432255
Bioconda (version 0.1.5)
conda install -c bioconda flanker
Conda + pip
conda create -n flanker -c bioconda python=3 abricate=1.0.1 mash conda activate flanker pip install git+https://github.com/wtmatlock/flanker
This is may be slightly ahead of the Bioconda release at times.
pip install git+https://github.com/wtmatlock/flanker # Install Mash and Abricate seperately
First we download some hybrid assemblies of plasmid genomes from David, Sophia, et al. "Integrated chromosomal and plasmid sequence analyses reveal diverse modes of carbapenemase gene spread among Klebsiella pneumoniae." Proceedings of the National Academy of Sciences 117.40 (2020): 25043-25054.
There are 44 plasmid genomes of which 16 and 28 contain blaKPC-2 and blaKPC-3, respectively. You'll need to take all replicons of interest (in this case the 44 plasmids) and concatenate these into a multi-FASTA file:
cat *fsa > david_plasmids.fasta
You should then rename the FASTA headers so that they match the original files (if this is not already the case). We have provided a simple script to do this (in /scripts/ directory on github):
ls *fsa | sed 's/[.]fsa//' > input_files python multi_fa_rename.py david_plasmids.fasta input_files david_plasmids_renamed.fasta mv david_plasmids_renamed.fasta david_plasmids.fasta
Now you are ready to use Flanker. In this example we are going to compare the flanking sequences around blaKPC-2. We are going to extract windows from 0bp (
--window) to 5000bp (
--wstop) base pairs in 100bp chuncks (
--wstep) to the upsteam (
--flank upstream) of the gene. We will include the gene (
--include_gene) and use the default NCBI database.
flanker --flank upstream --window 0 --wstop 5000 --wstep 100 --gene blaKPC-2 --fasta_file david_plasmids.fasta --include_gene
You should now see many fasta files in the working directory containing upstream flanking regions from 0 to 4900 bp.
||Input .fasta file|
||Space-delimited list of genes in the command line OR newline-delimited .txt of genes|
||Displays help information then closes Flanker||
||Choose which side(s) of the gene to extract (upstream/downstream/both)||
||One of "default" - normal mode, "mm" - multi-allelic cluster, or "sm" - salami-mode||
||Add if your sequence is circularised||
||Add if you want the gene included in the output .fasta||
||Add if you want to find the closest match to your query by string edit distance||
||Specify the database Abricate will use to find the gene(s)||
||Increase verbosity: 0 := only warnings, 1 := info, 2 := everything.||
||Flank length on either side of gene||
||For iterating: terminal flank length AND step size,
||Use clustering mode?||
||Prefix for clustering output file||-|
||Mash distance threshold for clustering||
||k-mer length for Mash||
||Sketch size for Mash||
||Threads to use for Mash||
Having extracted flanking sequences around a gene, you might then want to cluster them into groups which share high sequence identity. Flanker does this using single-linkage clustering of Mash distances. The method is very similar to that used by Ryan Wick in his Assembly-Dereplicator package (and indeed we re-use several of his functions).
-cl flag turns on clustering mode. The most important parameter is
-tr - after running Mash, Flanker filters the
Mash dist output so that only pairs with a mash dist smaller than this threshold are used to build the graph. You can also optionally tune the kmer-length
-k and sketch size
-s which mash uses, see Mash-Docs.
A seperate clustering file is produced for each window examined. The output is a comma separated file with two columns: sequence and cluster group. These can easily be combined for further processing:
cat out* | sed '/assembly/d' > all_out sed -i '1 i\flank,cluster' all_out
You can take this output and create figures similar to those in our manuscript (see the Binder on the Flanker GitHub page) or use in custom downstream applications.
If you feed flanker a list of genes (
--list_of_genes) in default mode (
--mode default), flanker considers each of these in turn. However, if you turn on multi-allelic mode (
--mode mm), it considers all genes in the list for each window. This allows you to detect flanking regions which are similar between different alleles of genes (e.g. blaKPC-2/3 etc) and between completely different genes.
Salami mode (
--mode sm) considers each window (of length
--wstep) from 0 to
--wstop as a seperate contiguous sequence; in default mode these are concatenated together. This is intended to allow detection of recombination/mobile genetic elements which are occur in diverse genetic contexts. At present Salami mode always starts at the edge of the gene and does not include the gene, therefore setting
-inc currently do nothing in this mode. As with the default mode, you can optionally perform clustering on the output from salami mode by using the
-cl flag. You cannot run salami with
-f both, if you try to do this alami will default to
For instance, here we extract 100bp windows from 0-5000bp downstream of the blaTEM-1B gene.
flanker --fasta_file example.fasta --gene blaTEM-1B --wstop 5000 --wstep 100 --flank downstream --mode sm
- Gene queries use substring matching by default, so e.g. querying
blawill return some beta-lactamase, if it's there. Also be mindful that non-default databases, such as Resfinder, add indexing after annotation names e.g.
blaCTX-M-15_1. Please check your Abricate output if you are unsure of the naming conventions. In closest match mode, the annotation with the smallest Levenshtein distance to your query will be used.
If you would like to contribute to this project, please open an issue or contact the authors directly.
To install and test a development build:
git clone https://github.com/wtmatlock/flanker cd flanker conda create -n flanker -c bioconda python=3 abricate=1.0.1 mash pytest conda activate flanker pip install --editable ./ pytest