Skip to content
Sateesh Peri edited this page Mar 13, 2022 · 3 revisions

Welcome to the mycosnp-nf wiki!

nf-core/mycosnp is a bioinformatics best-practice analysis pipeline for MycoSNP is a portable workflow for performing whole genome sequencing analysis of fungal organisms, including Candida auris. This method prepares the reference, performs quality control, and calls variants using a reference. MycoSNP generates several output files that are compatible with downstream analytic tools, such as those for used for phylogenetic tree-building and gene variant annotations..

Pipeline summary

Reference Preparation

Prepares a reference FASTA file for BWA alignment and GATK variant calling by masking repeats in the reference and generating the BWA index.

  • Genome repeat identification and masking (nucmer)
  • BWA index generation (bwa)
  • FAI and DICT file creation (Picard, Samtools)

Sample QC and Processing

Prepares samples (paired-end FASTQ files) for GATK variant calling by aligning the samples to a BWA reference index and ensuring that the BAM files are correctly formatted. This step also provides different quality reports for sample evaluation.

  • Combine FASTQ file lanes if they were provided with multiple lanes.
  • Filter unpaired reads from FASTQ files (SeqKit).
  • Down sample FASTQ files to a desired coverage or sampling rate (SeqTK).
  • Trim reads and assess quality (FaQCs).
  • Generate a QC report by extracting data from FaQCs report data.
  • Align FASTQ reads to a reference (BWA).
  • Sort BAM files (SAMTools).
  • Mark and remove duplicates in the BAM file (Picard).
  • Clean the BAM file (Picard "CleanSam").
  • Fix mate information in the BAM file (Picard "FixMateInformation").
  • Add read groups to the BAM file (Picard "AddOrReplaceReadGroups").
  • Index the BAM file (SAMTools).
  • FastQC - Filtered reads QC.
  • Qualimap mapping quality report.
  • MultiQC - Aggregate report describing results and QC from the whole pipeline

Variant calling and analysis

Calls variants and generates a multi-FASTA file and phylogeny.

  • Call variants (GATK HaplotypeCaller).
  • Combine gVCF files from the HaplotypeCaller into a single VCF (GATK CombineGVCFs).
  • Call genotypes using the (GATK GenotypeGVCFs).
  • Filter the variants (GATK VariantFiltration) [default (but customizable) filter: 'QD < 2.0 || FS > 60.0 || MQ < 40.0 || DP < 10'].
  • Run a customized VCF filtering script (Broad Institute).
  • Split the filtered VCF file by sample.
  • Select only SNPs from the VCF files (GATK SelectVariants).
  • Split the VCF file with SNPs by sample.
  • Create a consensus sequence for each sample (BCFTools, SeqTK).
  • Create a multi-fasta file from the VCF SNP positions using a custom script (Broad).
  • Create phylogeny from multi-fasta file (rapidNJ, FastTree2, RaxML, IQTree)