Skip to content

compare multiple fasta files using nucmer and extract aligned fragments

Notifications You must be signed in to change notification settings

alipirani88/find_hgt_pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Synopsis:

The pipeline compare multiple fasta files using nucmer and extracts aligned fragments that meet user-defined parameters.

Requirement:

  • assembly pseudomolecule in fasta format.
  • annotation files in gff and bed format. Requires both.

Preparing input

  • Making a pseudomolecule for input fasta files

Stich all contigs in de novo assembly to generate a psudomolecule using a linker sequence or order contigs against a reference genome using abacas.

Example:

Stich contigs using a linker sequence "NNNNNCATTCCATTCATTAATTAATTAATGAATGAATGNNNNN" or use psudomolecule generated by abacas contig ordering.


Pending contig stiching bash script

  • Annotations:

Annotate pseudomolecules using prokka or any other tool. The script expects an individual annotation folder for each sample consisting gff and bed files.

Usage

usage: recombination_analysis.py [-h] -filename FILENAME -out OUT -prokka_dir
                                 PROKKA_DIR [-jobrun JOBRUN] -dir DIR
                                 -analysis ANALYSIS_NAME
                                 [-remove_temp REMOVE_TEMP] -steps STEPS
                                 [-pbs PBS]

Recombination/HGT Analysis.
The pipeline takes a list of fasta files and aligns All-vs-All using Nucmer.
Extracts aligned region by parsing nucmer coordinate and gff/bed annotations to extract regions that matches user defined percent identity and minimum aligned length parameters.
Generates a preliminary reference database out of extracted aligned regions by deduplicating and removing containments.
Removes containment fragments from preliminary database using nucmer.
Performs nucmer alignment between pseudomolecule fasta file and final containment removed aligned fragments to generate an alignment matrix score.

optional arguments:
  -h, --help            show this help message and exit

Required arguments:
  -filename FILENAME    This file should contain a list of fasta filenames(one per line) that the user wants to use from argument -dir folder. For Genome coordinate consistency, make sure the fasta files are in a pseudomolecule format
  -out OUT              Output directory to save the results
  -prokka_dir PROKKA_DIR
                        Directory containing results of Prokka annotation pipeline or individual sample folders consisting gff and bed file. The folder name should match the fasta file prefix.
  -jobrun JOBRUN        Type of job to run. Run script on a compute cluster, parallelly on local or on local system(default): cluster, parallel-local, local
  -dir DIR              Directory containing fasta files specified in -filename list
  -analysis ANALYSIS_NAME
                        Unique Analysis Name to save results with this prefix

Optional arguments:
  -remove_temp REMOVE_TEMP
                        Remove Temporary directories from /tmp/ folder: yes/no
  -steps STEPS          Analysis Steps to be performed. Use All or 1,2,3,4,5 to run all steps of pipeline.
                        1: Align all assembly fasta input file against each other using Nucmer.
                        2: Parses the Nucmer generated aligned coordinates files, extract individual aligned fragments and their respective annotation for metadata.
                        3: Generate a database of these extracted aligned regions by deduplicating and removing containments using BBmaps dedupe tool.
                        4: Remove containments from preliminary database by running nucmer
                        5: Performs nucmer alignment between input fasta file and final containment removed extracted fragments to generate an alignment score matrix.
  -pbs PBS              Provide PBS memory resources for individual nucmer jobs. Default: nodes=1:ppn=1,pmem=4000mb,walltime=6:00:00

Example:

python recombination_analysis.py -filename filenames -out /path-to-out-dir/ -prokka_dir /path-to/fasta_file_annotations/ -jobrun parallel-local -dir /path-to-pseudomolecule/fasta_files/ -analysis 2018_07_18_analysis_name -step All

or

python recombination_analysis.py -filename filenames -out /path-to-out-dir/ -prokka_dir /path-to/fasta_file_annotations/ -jobrun parallel-local -dir //path-to-pseudomolecule/fasta_files/ -analysis 2018_07_18_analysis_name -step 1,2,3,4,5

Output:

  • Final_HGT_score_matrix.csv: This file contain final score matrix computed from nucmer alignments between uniquely extracted fragments and input assembly fasta file.

  • Final_HGT_score_matrix_meta.tsv: This file contains gene annotations for each uniquely extracted fragments.

About

compare multiple fasta files using nucmer and extract aligned fragments

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published