Move variant discovery to its own subcommand #234

mbhall88 · 2020-09-26T06:35:30Z

I found myself needing to expose more and more de novo parameters at the CLI and realised it was all cluttering map. There could also a slightly misleading perception that using --discover --genotype in map would genotype with the discovered variants (which it doesn't). Anyway, I forsee - when we have self-container de novo variant integration - the need to have this subcommand output a new PRG with variants added, or even another subcommand augment.

Added

New subcommand pandora discover

$ pandora discover --help
Quasi-map reads to an indexed PRG, infer the sequence of present loci in the sample and discover novel variants.
Usage: ./pandora discover [OPTIONS] <TARGET> <QUERY>

Positionals:
  <TARGET> FILE [required]    An indexed PRG file (in fasta format)
  <QUERY> FILE [required]     Fast{a,q} file containing reads to quasi-map

Options:
  -h,--help                   Print this help message and exit
  --discover-k INT            K-mer size to use when discovering novel variants [default: 11]
  --max-ins INT               Max. insertion size for novel variants. Warning: setting too long may impair performance [default: 15]
  --covg-threshold INT        Positions with coverage less than this will be tagged for variant discovery [default: 3]
  -l INT                      Min. length of consecutive positions below coverage threshold to trigger variant discovery [default: 1]
  -L INT                      Max. length of consecutive positions below coverage threshold to trigger variant discovery [default: 50]
  -P,--pad INT                Padding either side of candidate variant intervals [default: 22]
  -d,--merge INT              Merge candidate variant intervals within distance [default: 22]
  --min-dbg-dp INT            Minimum node/kmer depth in the de Bruijn graph used for discovering variants [default: 2]
  -v                          Verbosity of logging. Repeat for increased verbosity

Indexing:
  -w INT                      Window size for (w,k)-minimizers (must be <=k) [default: 14]
  -k INT                      K-mer size for (w,k)-minimizers [default: 15]

Input/Output:
  -o,--outdir DIR             Directory to write output files to [default: pandora_discover]
  -t,--threads INT            Maximum number of threads to use [default: 1]
  --kg                        Save kmer graphs with forward and reverse coverage annotations for found loci
  -M,--mapped-reads           Save a fasta file for each loci containing read parts which overlapped it

Parameter Estimation:
  -e,--error-rate FLOAT       Estimated error rate for reads [default: 0.11]
  -g,--genome-size STR/INT    Estimated length of the genome - used for coverage estimation. Can pass string such as 4.4m, 100k etc. [default: 5000000]
  --bin                       Use binomial model for kmer coverages [default: negative binomial]

Mapping:
  -m,--max-diff INT           Maximum distance (bp) between consecutive hits within a cluster [default: 250]
  -c,--min-cluster-size INT   Minimum size of a cluster of hits between a read and a loci to consider the loci present [default: 10]

Preset:
  -I,--illumina               Reads are from Illumina. Alters error rate used and adjusts for shorter reads

Filtering:
  --clean                     Add a step to clean and detangle the pangraph
  --max-covg INT              Maximum coverage of reads to accept [default: 600]

Consensus/Variant Calling:
  --kmer-avg INT              Maximum number of kmers to average over when selecting the maximum likelihood path [default: 100]

function to merge candidate regions within a certain distance [closes Merge candidate regions #228] and -d,--merge to specify the distance at CLI
Interval::is_close method
--min-dbg-dp to expose minimum adbundance in GATB graph
-l and -L for min/max candidate region length
-P,--pad to specify candidate region padding

Changed

allow -g,--genome-size to take a string such as 4.4m [closes Allow passing genome size string #233]
some logging cleanup/level changes

Removed

--coverages option from map as it wasn't being used

mbhall88 · 2020-09-29T05:13:39Z

I have (successfully) run the tip of this branch on multiple samples.

leoisl

Just two minor comments, very nice PR, I think this was an essential refactoring, thanks for taking the time to do this

include/denovo_discovery/candidate_region.h

src/interval.cpp

mbhall88 · 2020-10-13T05:51:54Z

As a way of validating this is working - in tandem with mbhall88/head_to_head_pipeline#54 - here is a picture of the precision and recall for the version of pandora in this PR

fixed_prgs is the data from this PR

Recall

Precision

Note: fixed_prgs is unfiltered here

mbhall88 added 6 commits September 26, 2020 16:22

logging changes

2f5baaf

merge candidate regions [iqbal-lab-org#228]

a9508a5

reduce min kmer abundance for GATB

40a22f3

expose denovo params for insert size and dbg covg

616d388

allow genome size to take a string [iqbal-lab-org#233]

db68603

move discover routine into its own subcommand

95f5fb3

mbhall88 marked this pull request as ready for review September 29, 2020 05:12

mbhall88 requested a review from leoisl September 29, 2020 05:12

mbhall88 mentioned this pull request Sep 29, 2020

Merge candidate regions #228

Closed

leoisl approved these changes Oct 8, 2020

View reviewed changes

include/denovo_discovery/candidate_region.h Outdated Show resolved Hide resolved

src/interval.cpp Outdated Show resolved Hide resolved

mbhall88 added 2 commits October 9, 2020 15:28

add names to header definition of Discover constructor

4ca9c8d

add is_close method for Interval

e1ead06

mbhall88 requested a review from leoisl October 9, 2020 21:40

mbhall88 mentioned this pull request Oct 13, 2020

Investigate low recall for pandora variant calls mbhall88/head_to_head_pipeline#53

Closed

leoisl approved these changes Oct 13, 2020

View reviewed changes

mbhall88 merged commit f1d0ede into iqbal-lab-org:dev Oct 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move variant discovery to its own subcommand #234

Move variant discovery to its own subcommand #234

mbhall88 commented Sep 26, 2020 •

edited

Loading

mbhall88 commented Sep 29, 2020

leoisl left a comment

mbhall88 commented Oct 13, 2020

Move variant discovery to its own subcommand #234

Move variant discovery to its own subcommand #234

Conversation

mbhall88 commented Sep 26, 2020 • edited Loading

Added

Changed

Removed

mbhall88 commented Sep 29, 2020

leoisl left a comment

Choose a reason for hiding this comment

mbhall88 commented Oct 13, 2020

Recall

Precision

mbhall88 commented Sep 26, 2020 •

edited

Loading