Skip to content
/ polap Public

Polap is a plant mtDNA assembly pipeline using Flye long-read assembler guided by organelle gene annotation.

License

Notifications You must be signed in to change notification settings

goshng/polap

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

45 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Description

Polap is a meta-assembler pipeline that treats Flye as a blackbox tool. Therefore, it can be described as a Flye helper pipeline with additional features of organellar gene annotation, which aids in the selection of organelle-origin contigs obtained from the initial Flye whole-genome assembly. The name "polap" was chosen as it was inspired by the chloroplast genome assembly pipeline known as ptGAUL, developed by Wenbin Zhou. Polap utilizes the long read genome assembler, Flye, in combination with the sequence polishing tool, FMLRC, developed by Matt Holt. It also leverages the sequencing depths obtained from a flye genome assembly and organelle gene annotation to identify contigs that belong to the user's target mitochondrial genome.

Polap is a project that originated from the idea of ptGAUL and aims to investigate how to assemble plant mitochondrial DNA (mtDNA) using long reads. The process begins by performing a Flye whole-genome assembly without a reference genome to generate contigs that are then subjected to BLAST to confirm the presence of mtDNA and ptDNA genes. Depth and Copy Number are also determined after Flye assembly. Using this information and the mtDNA and ptDNA genes, the pipeline helps make educated guess about which contigs originate from mtDNA.

Contigs that are suspected to originate from mtDNA are collected, and long reads that map well on to these contigs are used for mtDNA long-read assembly. Once mtDNA assembly is complete, the FMLRC polishing pipeline is employed to finalize the mtDNA.

Requirements

Please, note that polap runs on only a Linux OS with Miniconda. I do not know how I format the recipe for Bioconda's polap package, and it looks like that polap could be executed in other platforms. But, it does not run on either macOS or Windows computer. Although Bioconda's polap package can be installed into either of them, they would fail to execute.

It requires BASH (bash>=3.0).

bash --version

You must install a conda package manager like Miniconda on your favorite Linux Operating System in a command-line interface.

$ curl -OL https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
$ bash Miniconda3-latest-Linux-x86_64.sh

Your terminal should look like this after installing Miniconda, exiting, and relogin back to your Linux terminal. The prompt with (base) indicates your successful Miniconda installation.

(base) $

Now, configure your conda with the following:

(base) $ conda update -y -n base conda
(base) $ conda config --prepend channels bioconda
(base) $ conda config --prepend channels conda-forge

Make sure that your default channels are ordered like this:

(base) $ conda config --show channels
channels:
  - conda-forge
  - bioconda
  - defaults

Installing Polap

You could use Bioconda's polap package or the yaml file from this github source.

Using Biocoda polap package

You could install Bioconda's polap package in a new conda environment, and then run Polap with a test dataset.

(base) $ conda create --name polap bioconda::polap
(base) $ conda activate polap
(polap) $ git clone https://github.com/goshng/polap.git
(polap) $ cd polap/test
(polap) $ polap assemble --test

If you see a screen output ending with the something like this:

NEXT: polap prepare-polishing [-a s1.fq] [-b s2.fq]

You may be ready to use Polap. But, make sure that you do not have error messages from Polap.

Because FMLRC's conda environment was incompatible with that of Polap's, we need to create one for Polap's FMLRC conda environment. Continued from the previous command lines,

(polap) $ conda deactivate
(base) $ conda env create -f ../src/polap-conda-environment-fmlrc.yaml
(base) $ conda activate polap-fmlrc
(polap-fmlrc) $ ../src/polap.sh prepare-polishing
(polap-fmlrc) $ ../src/polap.sh polish

Your final mitochondrial genome sequence is mt.1.fa.

Using github source

Download the source code of Polap available at Polap's github website:

(base) $ git clone https://github.com/goshng/polap.git

Setup the conda environments for Polap.

(base) $ conda env create -f src/polap-conda-environment.yaml

Setup the conda environment for FMLRC.

(base) $ conda env create -f src/polap-conda-environment-fmlrc.yaml

Activate polap conda environment.

(base) $ conda activate polap

Run Polap with a test dataset.

(polap) $ cd test
(polap) $ ../src/polap.sh assemble --test

If you see a screen output ending with the something like this:

NEXT: polap prepare-polishing [-a s1.fq] [-b s2.fq]

You may be ready to use Polap. Continued from the previous command lines,

(polap) $ conda deactivate
(base) $ conda activate polap-fmlrc
(polap-fmlrc) $ ../src/polap.sh prepare-polishing
(polap-fmlrc) $ ../src/polap.sh polish

Your final mitochondrial genome sequence is mt.1.fa.

Using Polap

Once you passed the test dataset run with polap, you could execute polap two different ways depending on how you installed it. If you installed polap using Bioconda's polap package, you could simply type in polap at the prompt as long as you have the polap conda environment activated.

(polap) $ polap

Note that you have (polap) at your prompt. If you installed polap using the github source, then it would be easier to git-clone Polap at your data folder. You still must have the polap conda environment activated.

(polap) $ git clone https://github.com/goshng/polap.git
(polap) $ src/polap.sh

Let me simply use polap command by assuming that you installed polap using Bioconda's polap package. You would replace polap with the git-clone and src/polap.sh commands. If you are an experienced command line user, this introduction would be easy. If not, try to follow this README.

Input data

Now, you create a folder and place an input dataset in it: a single Oxford Nanopore long-read fastq file, and two short-read fastq files. The test folder from the git-clone source folder has a set of such fastq files. The case of a single short-read file is not tested, yet. If no input-files are specified by options with fastq input file paths, three input files must be at the current directory where you execute the polap script: l.fq, s1.fq, and s2.fq. The three fastq files must be extracted if they are compressed. You could use options such as -l, -a, and -b to set your fastq files. Please, execute polap without any option to see a help message.

Whole-genome assemby

Now that you have the three fastq files, the first polap step is to run a whole-genome assembly using Flye with menu assemble1. The first argument of the command polap is called menu. I do not know other better name for the first argument. A menu has its help message with a second argument of help. You would see options for the menu assemble1 with the following command:

(polap) $ polap assemble1 help

Assuming that you have the three fastq files: l.fq, s1.fq, and s2.fq, you could execute a whole-genome assembly step with the following command:

(polap) $ polap assemble1

It would take some times depending on your dataset. Please, be patient.

Plant organelle annotation

After the execution of the whole-genome assembly using Flye, you should be able to open o/0/30-contigger/graph_final.gfa using Bandage to view the genome assembly graph. The graph would display contigs, some of which originate from mitochondrial or plastid DNAs and most from nuclear DNA molecules. You could look through the graph to locate contigs from plant mitochondrial DNAs. If you could do so, then you could use the contigs for plant mitochondrial DNA candidates. If not, you continue the polap pipeline by annotating your initial whole-genome assembly.

(polap) $ polap annotate

This takes some time as well. So, be patient.

Organelle contig selection

After the execution of the polap annotation step, you should be able to see contigs with organelle gene annotations in o/0/contig-annotation-table.txt. There are two more similar text files: o/0/assembly_info_organelle_annotation_count-all.txt and o/0/assembly_info_organelle_annotation_count.txt. After inspecting the three files, choose contigs with more MT genes than PT genes in the assembly graph. Admittedly, this choice of contigs is subjective step. I might need another section to describe what one could do in this contig selection step. Let's assume that you have contig condidates that potentially originate from plant mitochondrial DNAs. The three text file above are a modified version of o/0/30-contigger/contigs_stats.txt output file from Flye. The file is a table with rows for contigs assembled from the whole-genome assembly. Each contig may consist of one or more edge sequences. The Bandage graph shows edge sequences rather than contig sequences. So choose candidate contigs using the Bandage graph and the three annotation table text file.

Now, you need to prepare a text file, o/0/mt.contig.name-1, in your folder. Add one edge sequence name per line. Note that the edge sequence names should start with edge_ not contig_. An example file for a MT contig name looks like this:

edge_1
edge_2
edge_3

For more information on how you could prepare a MT contig file, see MT contig name below.

Organelle-genome assembly

Now, you are ready to run an organelle-genome assembly.

(polap) $ polap assemble2

Wait for a while. This second genome assembly using Flye produces an assembly graph file called o/1/30-contigger/graph_final.gfa. Use Bandage to view the organelle-genome assembly graph.

Polishing

We hope that this organelle-genome assembly graph is simpler than that of the whole-genome assembly. If so, you would want to finish your organelle-genome assembly upto the polishing stage of Flye.

(polap) $ polap flye-polishing

Now, extract a mitochondrial genome sequence by opening o/1/assembly_graph.gfa using Bandage. Save the sequence with a file name called mt.0.fasta. This is the long-read mitochondrial genome assembly in fasta format.

We use the approach ptGAUL, which uses FMLRC to polish long-read assembly sequences using short-read data. The next step is just a wrapper of the script recommended by ptGAUL.

(polap) $ conda activate polap-fmlrc
(polap-fmlrc) $ src/polap.sh prepare-polishing
(polap-fmlrc) $ src/polap.sh polish

Your final mitochondrial genome sequence is mt.1.fa.

Uninstalling Polap

(polap) $ conda deactivate
(base) $ conda remove -n polap --all
(base) $ conda remove -n polap-fmlrc --all

Sample datasets

An example data set used to demonstrate the utility of Polap may be accessed at a figshare website: Polap data analysis of 11 datasets.

MT contig name

An MT contig name file should be edited to list contig names. You select contigs that you suspect might have been from mitochondrial DNAs. You may use the following for the three features of contigs.

  1. genome assembly graph,
  2. presence of organelle gene,
  3. copy numbers of contigs.

Note that contigs are named by edge suffixed with a number. You should use Bandage to view whole-genome assembly graph: o/0/30-contigger/graph_final.gfa. You could see organelle annotation and copy numbers in

$ column -t o/0/30-contigger/contigs_stats.txt

You could watch YouTube (in Korean) to see how you could select contigs that might have been from mitochondrial DNAs. The following is an example of the Flye assembly table with organelle gene counts. In the table, the column copy is the copy number for a contig, MT is the number of mitochondrial genes, and PT is the number of plastid genes. Edge is edeg numbers that constitute the contig at the corresponding line.

$ column -t o/assembly_info_organelle_annotation_count.txt
Contig Length V3 V4 V5 Copy V7 V8 MT PT Edge
contig_3 70078 6 N N 1 both * 19 41 1,3,-1
contig_1 39736 6 N Y 1 left * 16 16 1
contig_2 19223 3 N N 1 both * 11 13 2

Now that you have seen the table, edit o/0/mt.contig.name-1 to add lines of mtDNA contig candidates.

edge_1
edge_2
edge_3

You must have no empty lines in o/0/mt.contig.name-1. Once you have your mt.contig.name-1 file ready, then: Your next command is:

(polap) $ polap assemble2

Your draft assembly graph is o/1/30-contigger/graph_final.gfa

(polap) $ polap flye-polishing

Your long-read polished assembly graph is o/1/assembly_graph.gfa You could extract a draft organelle genome sequence from the assembly graph by using Bandage. You could watch YouTube (in Korean) to see how you could extract a DNA sequence through a circular path.

Options

Menu options

assemble1

: Flye whole-genome assembly

annotate

: Organelle gene annotation

assemble2

: Flye organelle-genome assembly

flye-polishing

: Flye assembly long-read polishing

prepare-polishing

: FMLRC short-read polishing preparation

polish

: FMLRC short-read polishing

General options

-o DIRECTORY, --outdir DIRECTORY

: Write output files to DIRECTORY. The default output directory is o.

-l FILE, --long-reads FILE

: Specify the long-read FASTQ file. If this option is not specified, the default long-read file will be used. It will be the l.fq at the current directory. The FASTQ file must be a regular file not compressed one.

-a FILE, --short-read1 FILE

: Specify the first short-read FASTQ file. If this option is not specified, the default long-read file will be used. It will be the s1.fq at the current directory. The FASTQ file must be a regular file not compressed one. Two short-read data files are required.

-b FILE, --short-read2 FILE

: Specify the second short-read FASTQ file. If this option is not specified, the default 1st short-read file will be used. It will be the s2.fq at the current directory. The FASTQ file must be a regular file not compressed one. Two short-read data files are required.

-p FILE, --unpolished-fasta FILE

: Specify the unpolished FASTA file. If this option is not specified, the default 2nd short-read file will be used. It will be the mt.0.fasta at the current directory.

-f FILE, --final-assembly FILE

: Specify the final genome assembly FASTA file. If this option is not specified, the default long-read file will be used. It will be the mt.1.fa at the current directory.

-m NUMBER, --min-read-length NUMBER

: Specify the minimum length of long-read sequences. If this option is not specified, the default minimum length will be used. It will be the 3000.

-t NUMBER, --threads NUMBER

: Specify the numbre CPU cores to use (default: 0).

-c NUMBER, --coverage NUMBER

: Specify coverage for the 2nd assembly (default: 30).

-r NUMBER, --pair-min NUMBER

: Specify the minimum mapped bases or PAF 11th column (default: 3000).

-x NUMBER, --bridge-min NUMBER

: Specify the minimum bridging read length or PAF 7th column (default: 3000).

-w NUMBER, --single-min NUMBER

: Specify the minimum mapped bases or PAF 11th column (default: 3000).

-i NUMBER, --inum NUMBER

: Specify the previous output number of organelle-genome assembly (default: 0).

-j NUMBER, --jnum NUMBER

: Specify the current output number of organelle-genome assembly (default: 1).

-v, --version

: Print version.

-h, --help

: Show usage message.

Authors

Copyright 2024 Sungshin Women's University. Released under the GNU General Public License version 3 This software carries no warranty of any kind.

About

Polap is a plant mtDNA assembly pipeline using Flye long-read assembler guided by organelle gene annotation.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published