Skip to content

Collections of scripts used for the paper PopDel identifies medium-size deletions simultaneously in tens of thousands of genomes

License

Notifications You must be signed in to change notification settings

kehrlab/PopDel-scripts

Repository files navigation

PopDel-scripts

Collection of scripts used for the paper Sebastian Niehus, Hákon Jónsson, Janina Schönberger, Eythór Björnsson, Doruk Beyter, Hannes P. Eggertsson, Patrick Sulem, Kári Stefánsson, Bjarni V. Halldórsson, Birte Kehr. PopDel identifies medium-size deletions jointly in tens of thousands of genomes.
Available at Nature Communications: https://www.nature.com/articles/s41467-020-20850-5
All generated output VCF/BCF files are published at Zenodo; DOI: https://doi.org/10.5281/zenodo.3992607

Content

This repository contains the scripts used for the comparisons and evaluations of the paper PopDel calls deletions jointly in tens of thousands of genomes. It does not contain the public data of of the Polaris HiSeqX Diversity Cohort, Polaris Kids Cohort or the Illumina Platinum Genome NA12878. They can be obtained from the respective online sources. Running the scripts requires you to adapt the paths in some of them.

Random Deletion Data Simulation

Simulation/uniform_simulation/deletion_data_Simulation/uniform_simulation/

Note on random seeds: We used 0 as random seed to simulate 2000 deletions with the script 'simulate_deletions.txt'. The random seed used for the simulation of reads using 'simulate.sh' was 1 for the first sample, 2 for the the second sample and so on.

Simulation/uniform_simulation/delly/

  • generateDelly.sh: generates the scripts for each batch size.
  • runDelly.sh: Runs the scripts generated by generateDelly.sh and measures the resource consumption.

Simulation/uniform_simulation/gridss/

  • generateGridss.sh: Generates the scripts for each batch size.
  • runGridss.sh: Calls all the scripts generated by generateGridss.sh and measures the resource consumption.
  • gridss.filter.sh: Deduplicates the break ends called by GRIDSS and annotates the VCFs.
  • annotate.R: Annotation script called by gridss.filter.sh.

Simulation/uniform_simulation/smoove/

Simulation/uniform_simulation/podel/

  • popdelProfile.sh: Contains the commands for creating a profile of each bam file.
  • runPopDel.sh: Runs all the scripts and measures the resource consumption.

Simulation/uniform_simulation/truth/ Contains an archive (truth.tar.gz) of all simulated variants for each batch size used for the evaluation. The files are the results of the deletion data simulation with above mentioned random seeds.

Simulation/uniform_simulation/plots/

  • eval_bed.sh: Evaluates the TP/FP/FN of all tools. Note that the range of the evaluation loop might have to be adjusted to match the batch sizes that the respective tools actually processed successfully.
  • compare_results.py: Script for alternative evaluation, based on fixed positional and size-estimate margins. can also consider genotypes of individual samples.
  • eval.sh: Wrapper for compare_results.py
  • simulation_plots.R: Script for generating all plots of the simulated data.

G1k Deletion Data Simulation

Simulation/uniform_simulation/deletion_data_Simulation/g1k_simulation/

Call Set comparison of Different Variant Callers on HG002 Trio

Ashkenazim_trio

Ashkenazim_trio

Ashkenazim_trio

  • Snakefile_GIAB_popdel: Snakefile to be used with Snakemake to manage Manta's workflow.
  • config.yml: Configuration for the corresponding Snakefile.
  • filter_manta.sh: Filters PopDel's call sets.
  • sampling.regions: Contains the regions PopDel uses for sampling the background distribution. If the option "-r grch37" is used, this file is not required and the sampling is performed on the same regions.
  • maxCov.tsv: File containing the desired maximum coverage for PopDel to consider for each of the three genomes. The selected values correspond to 3x each samples mean coverage.
  • all.GRCh37.profiles: File containing the paths to the profiles of the three genomes.

Ashkenazim_trio

  • mendelianError: Contains scripts for calculation and plotting of the Mendelian inheritance error.
  • precision_recall: Contains the scripts for calculation and plotting of precision-recall curve
  • venn_diagram: Contains scripts for calculation and plotting of the venn diagrams.

Call Set comparison of Different Variant Callers on NA12878

NA12878/delly/

NA12878/smoove/

NA12878/popdel/

  • profile.sh: Contains the command used for generating the profile of NA12878.
  • platinum.profiles: Contains the location of the profile generated by above command. Used as input for PopDel call.
  • Snakefile_call.popdel.platinum: Snakefile to be used with Snakemake to manage the PopDel call commands for each chromosome.

NA12878/reference

NA12878/plots

NA12878/plots/bedtools/

  • Snakefile.intersect: Snakefile for use with Snakemake. Manages the BED-conversion and overlap calculations via bedtools intersect.
  • config.yaml: Configuration of the evaluation.

Polaris HiSeqX Diversity Cohort

polaris_diversity_cohort/delly/

polaris_diversity_cohort/smoove/

polaris_diversity_cohort/popdelProfile/

polaris_diversity_cohort/popdelCall/

  • Snakefile_rnd_150: Snakefile for use with Snakemake. Applies PopDel call on each chromosome of all samples jointly.

polaris_diversity_cohort/plots/

Polaris Kids Cohort

polaris_kids_cohort/delly/

polaris_kids_cohort/smoove/

polaris_kids_cohort/popdel/

polaris_kids_cohort/plots/

  • hwe.py: Filters the VCF files according to the Hardy-Weinberg-Equilibrium.
  • transmission_ntrio.py: Calculates the transmission rates and Mendenlian inheritance error rates on the HWE-filtered VCFs.
  • kids.ped: Contains the pedigree information of the kids cohort. One trio <Parent1, Parent2, Child> per line.
  • tr-mendel.R: Script for creating the transmission rate plots and plots of Mendelian inheritance error.

About

Collections of scripts used for the paper PopDel identifies medium-size deletions simultaneously in tens of thousands of genomes

Resources

License

Stars

Watchers

Forks