Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move variant discovery to its own subcommand #234

Merged
merged 8 commits into from
Oct 13, 2020

Conversation

mbhall88
Copy link
Member

@mbhall88 mbhall88 commented Sep 26, 2020

I found myself needing to expose more and more de novo parameters at the CLI and realised it was all cluttering map. There could also a slightly misleading perception that using --discover --genotype in map would genotype with the discovered variants (which it doesn't). Anyway, I forsee - when we have self-container de novo variant integration - the need to have this subcommand output a new PRG with variants added, or even another subcommand augment.

Added

  • New subcommand pandora discover
$ pandora discover --help
Quasi-map reads to an indexed PRG, infer the sequence of present loci in the sample and discover novel variants.
Usage: ./pandora discover [OPTIONS] <TARGET> <QUERY>

Positionals:
  <TARGET> FILE [required]    An indexed PRG file (in fasta format)
  <QUERY> FILE [required]     Fast{a,q} file containing reads to quasi-map

Options:
  -h,--help                   Print this help message and exit
  --discover-k INT            K-mer size to use when discovering novel variants [default: 11]
  --max-ins INT               Max. insertion size for novel variants. Warning: setting too long may impair performance [default: 15]
  --covg-threshold INT        Positions with coverage less than this will be tagged for variant discovery [default: 3]
  -l INT                      Min. length of consecutive positions below coverage threshold to trigger variant discovery [default: 1]
  -L INT                      Max. length of consecutive positions below coverage threshold to trigger variant discovery [default: 50]
  -P,--pad INT                Padding either side of candidate variant intervals [default: 22]
  -d,--merge INT              Merge candidate variant intervals within distance [default: 22]
  --min-dbg-dp INT            Minimum node/kmer depth in the de Bruijn graph used for discovering variants [default: 2]
  -v                          Verbosity of logging. Repeat for increased verbosity

Indexing:
  -w INT                      Window size for (w,k)-minimizers (must be <=k) [default: 14]
  -k INT                      K-mer size for (w,k)-minimizers [default: 15]

Input/Output:
  -o,--outdir DIR             Directory to write output files to [default: pandora_discover]
  -t,--threads INT            Maximum number of threads to use [default: 1]
  --kg                        Save kmer graphs with forward and reverse coverage annotations for found loci
  -M,--mapped-reads           Save a fasta file for each loci containing read parts which overlapped it

Parameter Estimation:
  -e,--error-rate FLOAT       Estimated error rate for reads [default: 0.11]
  -g,--genome-size STR/INT    Estimated length of the genome - used for coverage estimation. Can pass string such as 4.4m, 100k etc. [default: 5000000]
  --bin                       Use binomial model for kmer coverages [default: negative binomial]

Mapping:
  -m,--max-diff INT           Maximum distance (bp) between consecutive hits within a cluster [default: 250]
  -c,--min-cluster-size INT   Minimum size of a cluster of hits between a read and a loci to consider the loci present [default: 10]

Preset:
  -I,--illumina               Reads are from Illumina. Alters error rate used and adjusts for shorter reads

Filtering:
  --clean                     Add a step to clean and detangle the pangraph
  --max-covg INT              Maximum coverage of reads to accept [default: 600]

Consensus/Variant Calling:
  --kmer-avg INT              Maximum number of kmers to average over when selecting the maximum likelihood path [default: 100]
  • function to merge candidate regions within a certain distance [closes Merge candidate regions #228] and -d,--merge to specify the distance at CLI
  • Interval::is_close method
  • --min-dbg-dp to expose minimum adbundance in GATB graph
  • -l and -L for min/max candidate region length
  • -P,--pad to specify candidate region padding

Changed

Removed

  • --coverages option from map as it wasn't being used

@mbhall88 mbhall88 marked this pull request as ready for review September 29, 2020 05:12
@mbhall88
Copy link
Member Author

I have (successfully) run the tip of this branch on multiple samples.

@mbhall88 mbhall88 mentioned this pull request Sep 29, 2020
Copy link
Collaborator

@leoisl leoisl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just two minor comments, very nice PR, I think this was an essential refactoring, thanks for taking the time to do this

include/denovo_discovery/candidate_region.h Outdated Show resolved Hide resolved
src/interval.cpp Outdated Show resolved Hide resolved
@mbhall88
Copy link
Member Author

As a way of validating this is working - in tandem with mbhall88/head_to_head_pipeline#54 - here is a picture of the precision and recall for the version of pandora in this PR

fixed_prgs is the data from this PR

Recall

image

Precision

Note: fixed_prgs is unfiltered here

image

@mbhall88 mbhall88 merged commit f1d0ede into iqbal-lab-org:dev Oct 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants