Skip to content
/ SC2CLIA Public

An SARS-CoV-2 bioinformatics pipeline for CLIA validation using Dr. Erin Young's Cecret StaphB pipeline as a base

Notifications You must be signed in to change notification settings

CDCgov/SC2CLIA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 

Repository files navigation

SC2CLIA

A SARS-CoV-2 Nextflow pipeline for Clinical Laboratories Improvements Amendments (CLIA) compliant variant calling and spike protein substitution prediction with additional tools for quality control, CLIA-ready reports, and consensus sequence uploads to NCBI.

TOC

Description

SC2CLIA Cecret is a CLIA compliance ready SARS-CoV-2 analysis workflow developed bioinformaticians from the Enterics Diseases Laboratory Branch at the Centers for Disease Control and Prevention with assistance from CDC's Respiratory Viruses Branch. It adds CDC-specific QA/QC metrics, CLIA-ready reports, database storage stubs, and consensus sequence uploads for NCBI to the Cecret SARS-CoV-2 workflow developed by Dr. Erin Young at the Utah Public Health Laboratories.

The SC2CLIA Cecret pipeline is designed to analyze SARS-CoV-2 sequencing with the ARTIC/Illumina hybrid library prep workflow for MiSeq data with protocols here and here.

Requirements

  1. Python 3 or higher. Download python here.

  2. Nextflow version 20+ is required here.

  3. Singularity version 3.7 is recommended. run singularity --version in your terminal
    Warning: version 3.5 does not work

    Warning: Singularity will use the default tmp dirtory for temporary storage, enough space is required. You might want to set SINGULARITY_TMPDIR to a directory which has enough space

  4. Cecret workflow installed. Read more about Cecret here.

Install

  1. Copy the Github repository to a folder
    git clone https://github.com/CDCgov/SC2CLIA.git

  2. Obtain the R Singularity container by downloading or building your own copy
    image url in settings.ini file - note if you just want to use the default version, the pipeline will automatically download the R Singularity container

  3. Make sure to update configuration files with your own custom paths in config folder - see Config folder below

[WARNING]

  • In order to run the pipeline as is, in Cecret/Cecret_alltools.nf, line #2103 and #2135, you will need to set MP= to the correct path (which is the mount point for R container, usually we set it to top level directory)

  • We turned off the report process. In order to run it, you will need to prepare your own version of report.pdf and report.tex (and put them under Cecret/configs/internal/), and set params.report = true in Cecret/Cecret_alltools.nf, line #74

  • We turned off the Kraken2 and bbmap processes. In order to run them, in Cecret/configs/internal/singularity.config, you will need to fill in the correct path for ‘kraken2_db’ and ‘bbmap’, and set params.bbmap and params.kraken2 to true there as well.

Usage

  1. Run the following script at your base folder(replace data_folder with the path to your data; r is for generating report files)
    ./run_cecret.sh - d data_folder
    (there is an optional flag -p to apply a different profile (default to v3) in the config file)
    (there is an optional flag -b to turn on bbmap process: map filtered reads to human genome GRCh38)

Main Components

Original Cecret Nextflow processes include:

  • seqyclean - for cleaning reads
  • fastp - for cleaning reads ; optional, faster alternative to seqyclean
  • bwa - for aligning reads to the reference
  • minimap2 - an alternative to bwa
  • ivar - calling variants and creating a consensus fasta; optional primer trimmer
  • samtools - for QC metrics and sorting; optional primer trimmer; optional converting bam to fastq files
  • fastqc - for QC metrics
  • bedtools - for depth estimation over amplicons
  • kraken2 - for read classification
  • pangolin - for lineage classification
  • nextclade - for clade classification
  • mafft - for multiple sequence alignment (optional, relatedness must be set to "true")
  • snp-dists - for relatedness determination (optional, relatedness must be set to "true")
  • iqtree - for phylogenetic tree generation (optional, relatedness must be set to "true")
  • bamsnap - to create images of SNPs
  • filter - to filter out human DNA from the reads

SC2CLIA Cecret Nextflow processes include:

  • vadr - for annotating fastas like NCBI (different than Erin's version)
  • pacbam_amplicons - for characterization of genomic regions and single nucleotide positions (for amplicons)
  • pacbam_orfs - for characterization of genomic regions and single nucleotide positions (for ORFs)
  • nextcladeParse - for parsing nextclade csv file and generating aa change stats
  • ivar_vcf - for converting ivar_variants tsv file into standard vcf file
  • coverage_depth (bwa,samtools) - for calculating average read coverage over non-N consensus positions
  • sc2ref - for calculating percentage of reads passing QC that align to reference
  • ncbi_upload - for ncbi GenBank submission
  • mqc - for generating MultiQC report
  • largest_indel - for calculating largest INDEL length
  • ampliconstats_dropout - for generating amplicon drop outs stats
  • bbmap - for mapping filtered reads to human genome GRCh38

Additional SC2CLIA R and Python scripts:

Configs folder

The configs folder contains some important reference files, configuration files and some static information files.

  • reference files - see Reference documents below
  • containers_fixedversion_hash.config - a central place for holding container information for all processes
  • author_template.csv - used by ncbi_upload module
  • submission_template.csv - used by ncbi_upload module
  • internal folder - We use this folder to hold any path/domain specific files/variables.
    • settings.ini - general purpose configuration file for the CDC SC2CLIA pipeline
    • singularity.config - custom configuration file for the CDC CLIA version of Cecret

Reference documents

Note

  1. Running the above script will generate a folder 'Run__' with all the resulting analysis, output, and QC files/folders in it

  2. The fastq data for all samples in an analysis should be in a single input data_folder.

Contributing

Future Plans

We might plan to containerize this pipeline in the future.

Resources

Cecret

Notices

Public Domain Notice

This repository constitutes a work of the United States Government and is not subject to domestic copyright protection under 17 USC § 105. This repository is in the public domain within the United States, and copyright and related rights in the work worldwide are waived through the CC0 1.0 Universal public domain dedication. All contributions to this repository will be released under the CC0 dedication. By submitting a pull request you are agreeing to comply with this waiver of copyright interest.

Privacy Notice

This repository contains only non-sensitive, publicly available data and information. All material and community participation is covered by the Disclaimer and Code of Conduct. For more information about CDC's privacy policy, please visit http://www.cdc.gov/other/privacy.html.

Contributing Notice

Anyone is encouraged to contribute to the repository by forking and submitting a pull request. (If you are new to GitHub, you might start with a basic tutorial.) By contributing to this project, you grant a world-wide, royalty-free, perpetual, irrevocable, non-exclusive, transferable license to all users under the terms of the Apache Software License v2 or later.

All comments, messages, pull requests, and other submissions received through CDC including this GitHub page may be subject to applicable federal law, including but not limited to the Federal Records Act, and may be archived. Learn more at http://www.cdc.gov/other/privacy.html.

Records Management Notice

This repository is not a source of government records, but is a copy to increase collaboration and collaborative potential. All government records will be published through the CDC web site.

About

An SARS-CoV-2 bioinformatics pipeline for CLIA validation using Dr. Erin Young's Cecret StaphB pipeline as a base

Resources

Stars

Watchers

Forks

Packages

No packages published