Skip to content

WEHIGenomicsRnD/duplex-seq-pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

36 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Duplex Sequencing Pipeline

A snakemake pipeline for obtaining duplex consensus reads generated using a duplex sequencing protocol (e.g. NanoSeq. The pipeline is similar to the recommend workflow from IDT on processing sequence data with unique molecular identifiers. QC is also performed, both pre- and post-duplex consensus calling.

Installation

The only prerequisite is snakemake. To install snakemake, you will need to install a Conda-based Python3 distribution. For this, Mambaforge is recommended. Once mamba is installed, snakemake can be installed like so:

mamba create -c conda-forge -c bioconda -n snakemake snakemake

Now activate the snakemake environment (you'll have to do this every time you want to run the pipeline):

conda activate snakemake

Now clone the repository:

git clone https://github.com/WEHIGenomicsRnD/duplex-seq-pipeline.git
cd duplex-seq-pipeline

Configuration

The configuration file is found under config/config.yaml and the config file for FastQ Screen is found under config/fastq_screen.conf. Please carefully go through these settings. The main settings to consider will be

  • read_structure -- ensure that this matches the UMI design of your experiment. Refer to fgbio's ExtractUmisFromBam for details on how to set this parameter.
  • umis -- if your UMIs are known (non-random), you can add a path to a UMIs text file (one UMI per line). Specifying a file path for this parameter will trigger a UMI correction step.
  • ref -- ensure you have downloaded and specified the correct reference for your data.

Running

Run the pipeline as follows:

conda activate snakemake
snakemake --use-conda --conda-frontend mamba --cores 1

If you want to submit your jobs to the cluster using SLURM, use the following to run the pipeline:

conda activate snakemake
snakemake --use-conda --conda-frontend mamba --profile slurm --jobs 8 --cores 24

The pipeline will generate all results under a results directory. The most relevant directories are:

  • results/QC/multiQC -- contains the multiQC report for pre-consensus call reads. Also contains FastQC and FastQ Screen metrics on the raw reads.
  • results/QC/consensus/multiQC -- contains the multiQC report generated on reads after duplex consensus calling.
  • results/consensus/{sample}__mapped_merged_filtered_clipped.bam -- contains the mapped, filtered and clipped consensus reads. These should be your "final" read alignments for duplex consensus reads.

About

Pipeline to process duplex sequencing data.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published