SC2CLIA

A SARS-CoV-2 Nextflow pipeline for Clinical Laboratories Improvements Amendments (CLIA) compliant variant calling and spike protein substitution prediction with additional tools for quality control, CLIA-ready reports, and consensus sequence uploads to NCBI.

Description

SC2CLIA Cecret is a CLIA compliance ready SARS-CoV-2 analysis workflow developed bioinformaticians from the Enterics Diseases Laboratory Branch at the Centers for Disease Control and Prevention with assistance from CDC's Respiratory Viruses Branch. It adds CDC-specific QA/QC metrics, CLIA-ready reports, database storage stubs, and consensus sequence uploads for NCBI to the Cecret SARS-CoV-2 workflow developed by Dr. Erin Young at the Utah Public Health Laboratories.

The SC2CLIA Cecret pipeline is designed to analyze SARS-CoV-2 sequencing with the ARTIC/Illumina hybrid library prep workflow for MiSeq data with protocols here and here.

Requirements

Python 3 or higher. Download python here.
Nextflow version 20+ is required here.
Singularity version 3.7 is recommended. run singularity --version in your terminal
Warning: version 3.5 does not work

Warning: Singularity will use the default tmp dirtory for temporary storage, enough space is required. You might want to set SINGULARITY_TMPDIR to a directory which has enough space
Cecret workflow installed. Read more about Cecret here.

Install

Copy the Github repository to a folder
git clone https://github.com/CDCgov/SC2CLIA.git
Obtain the R Singularity container by downloading or building your own copy
image url in settings.ini file - note if you just want to use the default version, the pipeline will automatically download the R Singularity container
Make sure to update configuration files with your own custom paths in config folder - see Config folder below

[WARNING]

In order to run the pipeline as is, in Cecret/Cecret_alltools.nf, line #2103 and #2135, you will need to set MP= to the correct path (which is the mount point for R container, usually we set it to top level directory)
We turned off the report process. In order to run it, you will need to prepare your own version of report.pdf and report.tex (and put them under Cecret/configs/internal/), and set params.report = true in Cecret/Cecret_alltools.nf, line #74
We turned off the Kraken2 and bbmap processes. In order to run them, in Cecret/configs/internal/singularity.config, you will need to fill in the correct path for ‘kraken2_db’ and ‘bbmap’, and set params.bbmap and params.kraken2 to true there as well.

Usage

Run the following script at your base folder(replace data_folder with the path to your data; r is for generating report files)
./run_cecret.sh - d data_folder
(there is an optional flag -p to apply a different profile (default to v3) in the config file)
(there is an optional flag -b to turn on bbmap process: map filtered reads to human genome GRCh38)

Main Components

Original Cecret Nextflow processes include:

seqyclean - for cleaning reads
fastp - for cleaning reads ; optional, faster alternative to seqyclean
bwa - for aligning reads to the reference
minimap2 - an alternative to bwa
ivar - calling variants and creating a consensus fasta; optional primer trimmer
samtools - for QC metrics and sorting; optional primer trimmer; optional converting bam to fastq files
fastqc - for QC metrics
bedtools - for depth estimation over amplicons
kraken2 - for read classification
pangolin - for lineage classification
nextclade - for clade classification
mafft - for multiple sequence alignment (optional, relatedness must be set to "true")
snp-dists - for relatedness determination (optional, relatedness must be set to "true")
iqtree - for phylogenetic tree generation (optional, relatedness must be set to "true")
bamsnap - to create images of SNPs
filter - to filter out human DNA from the reads

SC2CLIA Cecret Nextflow processes include:

vadr - for annotating fastas like NCBI (different than Erin's version)
pacbam_amplicons - for characterization of genomic regions and single nucleotide positions (for amplicons)
pacbam_orfs - for characterization of genomic regions and single nucleotide positions (for ORFs)
nextcladeParse - for parsing nextclade csv file and generating aa change stats
ivar_vcf - for converting ivar_variants tsv file into standard vcf file
coverage_depth (bwa,samtools) - for calculating average read coverage over non-N consensus positions
sc2ref - for calculating percentage of reads passing QC that align to reference
ncbi_upload - for ncbi GenBank submission
mqc - for generating MultiQC report
largest_indel - for calculating largest INDEL length
ampliconstats_dropout - for generating amplicon drop outs stats
bbmap - for mapping filtered reads to human genome GRCh38

Additional SC2CLIA R and Python scripts:

Custom R Singularity container definition file - building and using the custom Singularity container used to execute all R scripts in the pipeline
ORF statistics calculation scripts (R versions) - using and understanding the R scripts calculating ORF quality statistics for the pipeline
Reports - configuring and interpreting the different reports generated by the pipeline

Configs folder

The configs folder contains some important reference files, configuration files and some static information files.

reference files - see Reference documents below
containers_fixedversion_hash.config - a central place for holding container information for all processes
author_template.csv - used by ncbi_upload module
submission_template.csv - used by ncbi_upload module
internal folder - We use this folder to hold any path/domain specific files/variables.
- settings.ini - general purpose configuration file for the CDC SC2CLIA pipeline
- singularity.config - custom configuration file for the CDC CLIA version of Cecret

Reference documents

artic_V3_nCoV-2019.bed: Artic V3 primer scheme. Source.
artic_v3_nCoV-2019.insert.even.bed and artic_v3_nCoV-2019.insert.odd.bed: Artic V3 amplicon locations split across two BED files such that even and odd numbered amplicons are in different files. Source.
MN908947.3-ORF7b.bed and MN908947.3-ORFs.bed: Open reading frame annotations for SARS-CoV-2. Source was converted to a BED file. BED file was then split to avoid overlapping annotations in a single file.
MN908947.3.gff Open reading frame annotations for SARS-CoV-2. Source.
MN908947.3.fasta. SARS-CoV-2 reference genome sequence. Source.
MN908947.3.fasta.fai Fasta index file.

Note

Running the above script will generate a folder 'Run__' with all the resulting analysis, output, and QC files/folders in it
The fastq data for all samples in an analysis should be in a single input data_folder.

Contributing

Future Plans

We might plan to containerize this pipeline in the future.

Resources

Cecret

Notices

Public Domain Notice

This repository constitutes a work of the United States Government and is not subject to domestic copyright protection under 17 USC § 105. This repository is in the public domain within the United States, and copyright and related rights in the work worldwide are waived through the CC0 1.0 Universal public domain dedication. All contributions to this repository will be released under the CC0 dedication. By submitting a pull request you are agreeing to comply with this waiver of copyright interest.

Privacy Notice

This repository contains only non-sensitive, publicly available data and information. All material and community participation is covered by the Disclaimer and Code of Conduct. For more information about CDC's privacy policy, please visit http://www.cdc.gov/other/privacy.html.

Contributing Notice

Anyone is encouraged to contribute to the repository by forking and submitting a pull request. (If you are new to GitHub, you might start with a basic tutorial.) By contributing to this project, you grant a world-wide, royalty-free, perpetual, irrevocable, non-exclusive, transferable license to all users under the terms of the Apache Software License v2 or later.

All comments, messages, pull requests, and other submissions received through CDC including this GitHub page may be subject to applicable federal law, including but not limited to the Federal Records Act, and may be archived. Learn more at http://www.cdc.gov/other/privacy.html.

Records Management Notice

This repository is not a source of government records, but is a copy to increase collaboration and collaborative potential. All government records will be published through the CDC web site.

Name		Name	Last commit message	Last commit date
Latest commit History 532 Commits
Cecret		Cecret
docs		docs
.gitignore		.gitignore
run_cecret.sh		run_cecret.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SC2CLIA

TOC

Description

Requirements

Install

[WARNING]

Usage

Main Components

Original Cecret Nextflow processes include:

SC2CLIA Cecret Nextflow processes include:

Additional SC2CLIA R and Python scripts:

Configs folder

Reference documents

Note

Contributing

Future Plans

Resources

Notices

Public Domain Notice

Privacy Notice

Contributing Notice

Records Management Notice

About

Releases

Packages

Contributors 6

Languages

CDCgov/SC2CLIA

Folders and files

Latest commit

History

Repository files navigation

SC2CLIA

TOC

Description

Requirements

Install

[WARNING]

Usage

Main Components

Original Cecret Nextflow processes include:

SC2CLIA Cecret Nextflow processes include:

Additional SC2CLIA R and Python scripts:

Configs folder

Reference documents

Note

Contributing

Future Plans

Resources

Notices

Public Domain Notice

Privacy Notice

Contributing Notice

Records Management Notice

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 6

Languages

Packages