Skip to content
/ ANNEXA Public

Nextflow pipeline to extend reference annotation with nanopore reads, classify novel genes (mRNAs vs lncRNAs).

Notifications You must be signed in to change notification settings

IGDRion/ANNEXA

Repository files navigation

ANNEXA: Analysis of Nanopore with Nextflow for EXtended Annotation

Introduction

ANNEXA is an all-in-one reproductible pipeline, written in the Nextflow, which allows users to analyze LR-RNAseq data (Long-Read RNASeq), and to reconstruct and quantify known and novel genes and isoforms.

Pipeline summary

Metro map

ANNEXA works by using only three parameter files (a reference genome, a reference annotation and mapping files) and provides users with an extended annotation distinguishing between novel protein-coding (mRNA) versus long non-coding RNAs (lncRNA) genes. All known and novel gene/transcript models are further characterized through multiple features (length, number of spliced transcripts, normalized expression levels,...) available as graphical outputs.

  1. Check if the input annotation contains all the information needed.
  2. Transcriptome reconstruction and quantification with bambu.
  3. Novel classification with FEELnc.
  4. Retrieve information from input annotation and format final gtf with 3 level structure: gene -> transcript -> exon.
  5. Filter novel transcripts based on bambu NDR (Novel Discovery Rates) and/or TransforKmers TSS validation to assess fulllength transcripts.
  6. Perform a quality control of both the full and filtered extended annotations (see example).
  7. Optional: Check gene body coverage with RSeQC.

This pipeline has been tested with reference annotation from Ensembl and NCBI-RefSeq.

Usage

  1. Install Nextflow

  2. Test the pipeline on a small dataset

nextflow run IGDRion/ANNEXA \
    -profile test,singularity
  1. Run ANNEXA on your own data (change input, gtf, fa with path of your files).
nextflow run IGDRion/ANNEXA \
    -profile {test,docker,singularity,conda,slurm} \
    --input samples.txt \
    --gtf /path/to/ref.gtf \
    --fa /path/to/ref.fa

The input parameter takes a file listing the bam path files to analyze (see example below)

/path/to/1.bam
/path/to/2.bam
/path/to/3.bam

Options

Required:
--input             : Path to file listing paths to bam files.
--fa                : Path to reference genome.
--gtf               : Path to reference annotation.


Optional:
-profile test       : Run annexa on toy dataset.
-profile slurm      : Run annexa on slurm executor.
-profile singularity: Run annexa in singularity container.
-profile conda      : Run annexa in conda environment.
-profile docker     : Run annexa in docker container.

--filter            : Perform or not the filtering step. false by default.
--tfkmers_tokenizer : Path to TransforKmers tokenizer. Required if filter activated.
--tfkmers_model     : Path to TransforKmers model. Required if filter activated.
--bambu_threshold   : bambu NDR threshold below which new transcripts are retained.
--tfkmers_threshold : TransforKmers prediction threshold below which new transcripts are retained.
--operation         : Operation to retained novel transcripts. "union" retain tx validated by either bambu or transforkmers, "intersection" retain tx validated by both.

--withGeneCoverage  : Run RSeQC (can be long depending on annotation and bam sizes). False by default.

--maxCpu            : max cpu threads used by ANNEXA. 8 by default.
--maxMemory         : max memory used by ANNEXA. 40GB by default.

If the filter argument is set to true, TransforKmers model and tokenizer paths have to be given. They can be either downloaded from the TransforKmers official repository or trained in advance by yourself on your own data.

Filtering step

By activating the filtering step (--filter), ANNEXA proposes to filter the generated extended annotation according to 2 methods:

  1. By using the NDR proposed by bambu. This threshold includes several information such as sequence profile, structure (mono-exonic, etc) and quantification (number of samples, expression). Each transcript with an NDR below the classification threshold will be retained by ANNEXA (default: 0.2).

  2. By analysing the Transcription Start Sites (TSS) of each new transcripts using the TransforKmers deep-learning based tool. Each TSS validated below a certain threshold will be retained (default: 0.2). We already provide 2 trained models for filtering TSS with TransforKmers.

To use them, extract the zip, and point --tfkmers_model and --tfkmers_tokenizer to the subdirectories.

The filtered annotation can be the union of these 2 tools, i.e. all the transcripts validated by one or both of these tools; or the intersection, i.e. the transcripts validated by both tools (the latter being the default).

At the end, the QC steps are performed both on the full and filtered extended annotations.