Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reimplement and improve CLI #224

Merged
merged 27 commits into from
Sep 26, 2020
Merged

Reimplement and improve CLI #224

merged 27 commits into from
Sep 26, 2020

Conversation

mbhall88
Copy link
Member

@mbhall88 mbhall88 commented Sep 1, 2020

This PR moves the handling of the Command-line interface (CLI) to CLI11 and closes #53.

Additionally, some changes in this PR are not backwards compatible as some options have been renamed/removed.

This PR also contains a hotfix that closes #226

@mbhall88
Copy link
Member Author

mbhall88 commented Sep 1, 2020

Index

Old

$ pandora index -h
Usage: pandora index [options] <prgs.fa>
Options:
        -h,--help                       Show this help message
        -w W                            Window size for (w,k)-minimizers, default 14
        -k K                            K-mer size for (w,k)-minimizers, default 15
        -t T                            Number of threads, default 1
        --offset                                Offset for PRG ids, default 0
        --outfile                               Filename for index
        --log_level                     debug,[info],warning,error

New

$ pandora index -h
Index population reference graph (PRG) sequences.
Usage: pandora index [OPTIONS] <PRG>

Positionals:
  <PRG> FILE [required]       PRG to index (in fasta format)

Options:
  -h,--help                   Print this help message and exit
  -w INT                      Window size for (w,k)-minimizers (must be <=k) [default: 14]
  -k INT                      K-mer size for (w,k)-minimizers [default: 15]
  -t,--threads INT            Maximum number of threads to use [default: 1]
  -o,--outfile FILE           Filename for the index [default: <PRG>.k15.w14.idx]
  -v                          Verbosity of logging. Repeat for increased verbosity

The only point of discussion here I think (which is the same across all subcommands) is the change to the log level indication. Instead of asking the user to state the level explicitly (i.e. info, debug etc.) it is assumed as info and if they want more logging they pass -v to indicate debug and -vv to indicate trace.
Note: this change also addresses and closes #160

Another nice thing about using CLI11 is there are some nice built-in validators. For this subcomand, if<PRG> does not exist, CLI11 throws an exception for us.

@mbhall88
Copy link
Member Author

mbhall88 commented Sep 1, 2020

Map

Old

$ pandora map -h
only provided 2arguments
Usage: pandora map -p PRG_FILE -r READ_FILE <option(s)>
Options:
        -h,--help                       Show this help message
        -p,--prg_file PRG_FILE          Specify a fasta-style prg file
        -r,--read_file READ_FILE        Specify a file of reads in fasta format
        -o,--outdir OUTDIR      Specify directory of output
        -w W                            Window size for (w,k)-minimizers, must be <=k, default 14
        -k K                            K-mer size for (w,k)-minimizers, default 15
        -m,--max_diff INT               Maximum distance between consecutive hits within a cluster, default 500 (bps)
        -e,--error_rate FLOAT           Estimated error rate for reads, default 0.11
        -t T                            Number of threads, default 1
        --genome_size   NUM_BP  Estimated length of genome, used for coverage estimation
        --output_kg                     Save kmer graphs with fwd and rev coverage annotations for found localPRGs
        --output_vcf                    Save a vcf file for each found localPRG
        --vcf_refs REF_FASTA            A fasta file with an entry for each LocalPRG giving reference sequence for
                                        VCF. Must have a perfect match in the graph and the same name as the graph
        --output_comparison_paths       Save a fasta file for a random selection of paths through localPRG
        --output_covgs  Save a file of covgs for each localPRG present, one number per base of fasta file
        --output_mapped_read_fa Save a file for each gene containing read parts which overlapped it
        --illumina                      Data is from illumina rather than nanopore, so is shorter with low error rate
        --clean                 Add a step to clean and detangle the pangraph
        --bin                   Use binomial model for kmer coverages, default is negative binomial
        --max_covg                      Maximum average coverage from reads to accept
        --genotype MODE                 Add extra step to carefully genotype sites. Has two modes: global (ML path oriented) or local (coverage oriented)
        --snps_only                     When genotyping, include only snp sites
        -d,--discover                   Add denovo discovery
        --denovo_kmer_size                      Kmer size to use for denovo discovery
        --log_level                     debug,[info],warning,error

New

$ pandora map -h
Quasi-map reads to an indexed PRG, infer the sequence of present loci in the sample, and (optionally) genotype/discover variants.
Usage: ./pandora map [OPTIONS] <TARGET> <QUERY>

Positionals:
  <TARGET> FILE [required]    An indexed PRG file (in fasta format)
  <QUERY> FILE [required]     Fast{a,q} file containing reads to quasi-map

Options:
  -h,--help                   Print this help message and exit
  -v                          Verbosity of logging. Repeat for increased verbosity

Indexing:
  -w INT                      Window size for (w,k)-minimizers (must be <=k) [default: 14]
  -k INT                      K-mer size for (w,k)-minimizers [default: 15]

Input/Output:
  -o,--outdir DIR             Directory to write output files to [default: pandora]
  -t,--threads INT            Maximum number of threads to use [default: 1]
  --vcf-refs FILE             Fasta file with a reference sequence to use for each loci. The sequence MUST have a perfect match in <TARGET> and the same name
  --kg                        Save kmer graphs with forward and reverse coverage annotations for found loci
  --loci-vcf                  Save a VCF file for each found loci
  -C,--comparison-paths       Save a fasta file for a random selection of paths through loci
  --coverages                 Save a file of coverages for each loci present - one number per base
  -M,--mapped-reads           Save a fasta file for each loci containing read parts which overlapped it

Parameter Estimation:
  -e,--error-rate FLOAT       Estimated error rate for reads [default: 0.11]
  -g,--genome-size INT        Estimated length of the genome - used for coverage estimation [default: 5000000]
  --bin                       Use binomial model for kmer coverages [default: negative binomial]

Mapping:
  -m,--max-diff INT           Maximum distance (bp) between consecutive hits within a cluster [default: 250]
  -c,--min-cluster-size INT   Minimum size of a cluster of hits between a read and a loci to consider the loci present [default: 10]

Preset:
  -I,--illumina               Reads are from Illumina. Alters error rate used and adjusts for shorter reads

Filtering:
  --clean                     Add a step to clean and detangle the pangraph
  --max-covg INT              Maximum average coverage of reads to accept [default: 300]

Consensus/Variant Calling:
  --genotype                  Add extra step to carefully genotype sites.
  --snps                      When genotyping, only include SNP sites
  -d,--discover               Add a step to discover de novo variants
  --discover-k INT            Kmer size to use when disovering de novo variants [default: 11]
  --kmer-avg INT              Maximum number of kmers to average over when selecting the maximum likelihood path [default: 100]

Genotyping:
  --local Needs: --genotype   (Intended for developers) Use coverage-oriented (local) genotyping instead of the default ML path-oriented (global) approach.
  -a INT                      Hard threshold for the minimum allele coverage allowed when genotyping [default: 0]
  -s INT                      The minimum required total coverage for a site when genotyping [default: 0]
  -D INT                      Minimum difference in coverage on a site required between the first and second maximum likelihood path [default: 0]
  -F INT                      Minimum allele coverage, as a fraction of the expected coverage, allowed when genotyping [default: 0]
  -E,--gt-error-rate FLOAT    When genotyping, assume that coverage on alternative alleles arises as a result of an error process with rate -E. [default: 0.01]
  -G,--gt-conf INT            Minimum genotype confidence (GT_CONF) required to make a call [default: 1]

There is a lot to address here. In particular, I have added "Should this be exposed?" to some options. If we do want these options exposed, then a description is needed for them. Additionally, I would like to shorten a lot of these options as some are far too verbose. I have left two options without a description - --confidence-threshold and --genotyping-error-rate as I struggled to determine what these do from the code.

One outstanding thing is to create groups within the menu (similar to tools like minimap2 and bwa) so we would have something like "Input/Output Options", "Mapping", "Preset" etc. Please feel free to suggest groups. I will wait until we have settled on what options should be removed (if any).

Two breaking changes of importance to note:

  • removing the snakecase convention in options in favour of the more widely used "kebab case". i.e. --confidence_threshold becomes --confidence-threshold. If this is a big issue I will change it back, but I really dislike snake case in CLI opts.
  • the PRG and reads files are now positional arguments <TARGET> and <QUERY> respectively.

This change to CLI11 for map also closes #140

Lastly, I have added 'todo' comments above any options that I have questions or concerns about (map_main.cpp function setup_map_subcommand) so could we please discuss those in the code review? In particular, there is a comment on line 123 relating to a discussion @iqbal-lab and I had about whether we just move purely to global genotyping.

@mbhall88
Copy link
Member Author

mbhall88 commented Sep 1, 2020

Main

Old

$ pandora --help
[main] unrecognized command '--help'

New

$ pandora --help
Pandora: Pan-genome inference and genotyping with long noisy or short accurate reads.
Usage: pandora [OPTIONS] SUBCOMMAND

Options:
  -h,--help                   Print this help message and exit

Subcommands:
  index                       Index population reference graph (PRG) sequences.
  map                         Quasi-map reads to an indexed PRG, infer the sequence of present loci in the sample, and (optionally) genotype/discover variants.
  compare                     Quasi-map reads from multiple samples to an indexed PRG, infer the sequence of present loci in each sample, and call variants between the samples.
  walk                        Outputs a path through the nodes in a PRG corresponding to the either an input sequence (if it exists) or the top/bottom path
  seq2path                    For each sequence, return the path through the PRG
  get_vcf_ref                 Outputs a fasta suitable for use as the VCF reference using input sequences
  random                      Outputs a fasta of random paths through the PRGs
  merge_index                 Allows multiple indicies to be merged (no compatibility check)

Note: I will update this comment as I add more subcommands.
Closes #159

@mbhall88
Copy link
Member Author

mbhall88 commented Sep 1, 2020

Compare

Old

$ pandora compare -h
Usage: pandora compare -p PRG_FILE -r READ_INDEX -o OUTDIR <option(s)>
Options:
        -h,--help                       Show this help message
        -p,--prg_file PRG_FILE          Specify a fasta-style prg file
        -r,--read_index READ_INDEX      Specify a file with a line per sample
                                        sample_id <tab> filepath to reads in fasta/q format
        -o,--outdir OUTDIR      Specify directory of output
        -w W                            Window size for (w,k)-minimizers, default 14
        -k K                            K-mer size for (w,k)-minimizers, default 15
        -m,--max_diff INT               Maximum distance between consecutive hits within a cluster, default 250 (bps)
        -e,--error_rate FLOAT           Estimated error rate for reads, default 0.11
        -t T                            Number of threads, default 1
        --genome_size   NUM_BP  Estimated length of genome, used for coverage estimation
        --vcf_refs REF_FASTA            A fasta file with an entry for each LocalPRG giving reference sequence for
                                        VCF. Must have a perfect match in the graph and the same name as the graph
        --illumina                      Data is from illumina rather than nanopore, so is shorter with low error rate
        --clean                 Add a step to clean and detangle the pangraph
        --bin                   Use binomial model for kmer coverages, default is negative binomial
        --max_covg                      Maximum average coverage from reads to accept
        --genotype MODE                 Add extra step to carefully genotype sites. Has two modes: global (ML path oriented) or local (coverage oriented)
        --log_level                     debug,[info],warning,error

New

$ pandora compare -h
Quasi-map reads from multiple samples to an indexed PRG, infer the sequence of present loci in each sample, and call variants between the samples.
Usage: ./pandora compare [OPTIONS] <TARGET> <QUERY_IDX>

Positionals:
  <TARGET> FILE [required]    An indexed PRG file (in fasta format)
  <QUERY_IDX> FILE [required] A tab-delimited file where each line is a sample identifier followed by the path to the fast{a,q} of reads for that sample

Options:
  -h,--help                   Print this help message and exit
  -v                          Verbosity of logging. Repeat for increased verbosity

Indexing:
  -w INT                      Window size for (w,k)-minimizers (must be <=k) [default: 14]
  -k INT                      K-mer size for (w,k)-minimizers [default: 15]

Input/Output:
  -o,--outdir DIR             Directory to write output files to [default: pandora]
  -t,--threads INT            Maximum number of threads to use [default: 1]
  --vcf-refs FILE             Fasta file with a reference sequence to use for each loci. The sequence MUST have a perfect match in <TARGET> and the same name
  --loci-vcf                  Save a VCF file for each found loci

Parameter Estimation:
  -e,--error-rate FLOAT       Estimated error rate for reads [default: 0.11]
  -g,--genome-size INT        Estimated length of the genome - used for coverage estimation [default: 5000000]
  --bin                       Use binomial model for kmer coverages [default: negative binomial]

Mapping:
  -m,--max-diff INT           Maximum distance (bp) between consecutive hits within a cluster [default: 250]
  -c,--min-cluster-size INT   Minimum size of a cluster of hits between a read and a loci to consider the loci present [default: 10]

Preset:
  -I,--illumina               Reads are from Illumina. Alters error rate used and adjusts for shorter reads

Filtering:
  --clean                     Add a step to clean and detangle the pangraph
  --max-covg INT              Maximum average coverage of reads to accept [default: 300]

Consensus/Variant Calling:
  --genotype                  Add extra step to carefully genotype sites.
  --kmer-avg INT              Maximum number of kmers to average over when selecting the maximum likelihood path [default: 100]

Genotyping:
  --local Needs: --genotype   (Intended for developers) Use coverage-oriented (local) genotyping instead of the default ML path-oriented (global) approach.
  -a INT                      Hard threshold for the minimum allele coverage allowed when genotyping [default: 0]
  -s INT                      The minimum required total coverage for a site when genotyping [default: 0]
  -D INT                      Minimum difference in coverage on a site required between the first and second maximum likelihood path [default: 0]
  -F INT                      Minimum allele coverage, as a fraction of the expected coverage, allowed when genotyping [default: 0]
  -E,--gt-error-rate FLOAT    When genotyping, assume that coverage on alternative alleles arises as a result of an error process with rate -E. [default: 0.01]
  -G,--gt-conf INT            Minimum genotype confidence (GT_CONF) required to make a call [default: 1]

The changes here are similar to map. However, I noticed that whilst there are a lot of options in common, some aren't exposed in compare but are exposed in map (e.g. --min-kmer-covg)

@leoisl
Copy link
Collaborator

leoisl commented Sep 1, 2020

Thanks a lot for this, looks really nice! Thanks for taking the time to improve the CLI!

I have some comments below, but I realise that I am asking some stuff that might take considerable development time, so feel free to ignore everything. Also what you are doing already improve the CLI by a lot, so I think it is already enough. What I am proposing would be towards cleaning more the CLI and eventually breaking commands into more subcommands... but I don't think working a lot on CLI is a high priority task in our TODO list. I would be happy to review and merge your changes, and consider that CLI work is done.

Regarding pandora index:

--offset INT Offset for PRG ids [default: 0]

I feel like this parameter was sth required so that we could split the PRG into several subfiles, and launch a pandora index job for each subfile, and then combine the index later. I guess this is not required anymore, as pandora index takes a couple of hours? There are still also performance improvements we can do in pandora index if we think it is not quick enough.

Regarding pandora map/compare:

-w INT Window size for (w,k)-minimizers (must be <=k) [default: 14]
-k INT K-mer size for (w,k)-minimizers [default: 15]

This is actually not required, given that the w and k used to build the index is in the index filename. I actually think it is error prone to ask for w and k in pandora map/compare. User might set a specific value for w and k when running pandora index and do not put this value for pandora map/compare. Anyway, it is an information that is better obtained by parsing the index filename instead of asking for it.
PS: I am not sure this belongs to this issue. It might be better to create an issue for this;
PS2: this is not major, so not high priority. I actually would prefer to have a metadata file along with the index, instead of parsing metadata from filenames.

Regarding subcommands breakdown:

I think pandora should produce unfiltered VCFs. Then we should have a pandora view or pandora filter command to filter VCFs accordingly. Then some parameters might shift into this command... But I am not sure how many parameters we can move, as many of them affects the ML path, so it might not be possible...

There are some more subcommands breakdown, but I don't think will be productive to discuss them now...

@mbhall88
Copy link
Member Author

mbhall88 commented Sep 1, 2020

Walk

Old

$ pandora walk -h
Usage: pandora walk <in_prg.fa> [<seq.fa> | --top | --bottom]

New

$ pandora walk -h
Outputs a path through the nodes in a PRG corresponding to the either an input sequence (if it exists) or the top/bottom path
Usage: ./pandora walk [OPTIONS] <PRG>

Positionals:
  <PRG> FILE [required]       A PRG file (in fasta format)

Options:
  -h,--help                   Print this help message and exit
  -i,--input FILE Excludes: --top --bottom
                              Fast{a,q} of sequences to output paths through the PRG for
  -T,--top Excludes: --input --bottom
                              Output the top path through each local PRG
  -B,--bottom Excludes: --input --top
                              Output the bottom path through each local PRG

@iqbal-lab
Copy link
Collaborator

re this

This is actually not required, given that the w and k used to build the index is in the index filename. I actually think it is error prone to ask for w and k in pandora map/compare. User might set a specific value for w and k when running pandora index and do not put this value for pandora map/compare. Anyway, it is an information that is better obtained by parsing the index filename instead of asking for it.

Depending on the filename is better , but also error prone as people fiddle with filenames - inside a header would be safest. happy to postpone this though

@mbhall88
Copy link
Member Author

mbhall88 commented Sep 1, 2020

Re: removing --offset from index. I will do this as it is a simple change.

Re: removing k and w I am a little torn about this. I don't like the idea of getting it from the filename, as the user has the option to name the index something else with --outfile. As Zam mentions, we could somehow encode it in the index header? Although, as an example, minimap2 leaves it to the user to ensure they use the same present, -x, for both indexing and mapping. So it wouldn't be unusual to maintain the current functionality. If we do want to alter this behaviour, then it should be a separate issue. This PR is just about switching the CLI implementation. There are other issues closely linked that we can then address afterwards, such as this, and consistent use of filesystem paths [#68]

Speaking of the index --outfile parameter, I think this option is actually useless? Regardless of what it is named I think pandora always looks for an expected name? There are two load functions, but this is always called.

@leoisl
Copy link
Collaborator

leoisl commented Sep 2, 2020

Speaking of the index --outfile parameter, I think this option is actually useless? Regardless of what it is named I think pandora always looks for an expected name? There are two load functions, but this is always called.

Yeah, I think --outfile is not very useful. I think https://github.com/mbhall88/pandora/blob/e497912ab645f58daf1da84cfdab6bed2983b591/include/index.h#L37 is a helper function, and should be made private, whereas https://github.com/mbhall88/pandora/blob/e497912ab645f58daf1da84cfdab6bed2983b591/include/index.h#L35 is in the public interface.

There is a refactoring of the code that I have been postponing since the beginning, which is applying encapsulation to the code (i.e. hiding all attributes, providing access to them through getters/setters, and hide helper methods, just exposing the minimal public interface to classes clients). It will make things a lot easier to change and maintain later (i.e. many parts of the code access attributes directly, and thus when changing, for e.g. a std::map to an IntervalTree data structure as we did at some point to speed things up, we actually had to change a lot of the code because of this direct access. Encapsulation would make this a lot easier).
I am still unsure about doing this refactoring. I think refactoring code is great and should be a constant effort, but there absolutely no research progress, it is basically code maintenance, and refactoring can be quite lengthy and require quite a big effort. This might be better discussed in a meeting after the paper on how much time to devote for code maintenance, and adding features to pandora.

@mbhall88
Copy link
Member Author

mbhall88 commented Sep 2, 2020

There is a refactoring of the code that I have been postponing since the beginning, which is applying encapsulation to the code (i.e. hiding all attributes, providing access to them through getters/setters, and hide helper methods, just exposing the minimal public interface to classes clients). It will make things a lot easier to change and maintain later (i.e. many parts of the code access attributes directly, and thus when changing, for e.g. a std::map to an IntervalTree data structure as we did at some point to speed things up, we actually had to change a lot of the code because of this direct access. Encapsulation would make this a lot easier).
I am still unsure about doing this refactoring. I think refactoring code is great and should be a constant effort, but there absolutely no research progress, it is basically code maintenance, and refactoring can be quite lengthy and require quite a big effort. This might be better discussed in a meeting after the paper on how much time to devote for code maintenance, and adding features to pandora.

https://www.artima.com/intv/fixit.html

@mbhall88
Copy link
Member Author

mbhall88 commented Sep 2, 2020

Group names are added to #224 (comment). Let me know if anything needs changing

@mbhall88
Copy link
Member Author

mbhall88 commented Sep 5, 2020

check_kmergraph seq2path

Old

Usage: pandora check_kmergraph <prg.fa> <seq.fa> <k> <w> <flag>

New

$ pandora seq2path -h
For each sequence, return the path through the PRG
Usage: pandora seq2path [OPTIONS] <PRG>

Positionals:
  <PRG> FILE [required]       PRG to index (in fasta format)

Options:
  -h,--help                   Print this help message and exit
  -i,--input FILE Excludes: --top --bottom
                              Fast{a,q} of sequences to output paths through the PRG for
  -w INT                      Window size for (w,k)-minimizers (must be <=k) [default: 14]
  -k INT                      K-mer size for (w,k)-minimizers [default: 15]
  -T,--top Excludes: --input --bottom
                              Output the top path through each local PRG
  -B,--bottom Excludes: --input --top
                              Output the bottom path through each local PRG
  --flag Needs: --input       output success/fail rather than the node path
  -v                          Verbosity of logging. Repeat for increased verbosity

@mbhall88
Copy link
Member Author

mbhall88 commented Sep 5, 2020

get_vcf_ref

Old

$ pandora get_vcf_ref -h
[2020-09-05 07:05:30.279251] [0x00007f0a41346f80] [debug]   Loading PRGs from file -h
terminate called after throwing an instance of 'std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >'
SIGABRT: abort

followed by a whole bunch of singularity crazy syscall error messages 😬

New

$ pandora get_vcf_ref -h
Outputs a fasta suitable for use as the VCF reference using input sequences
Usage: ./pandora get_vcf_ref [OPTIONS] <PRG> [<QUERY>]

Positionals:
  <PRG> FILE [required]       PRG to index (in fasta format)
  <QUERY> FILE                Fast{a,q} file of sequences for some reason???

Options:
  -h,--help                   Print this help message and exit
  -z,--compress               Compress the output with gzip
  -v                          Verbosity of logging. Repeat for increased verbosity

@mbhall88
Copy link
Member Author

mbhall88 commented Sep 5, 2020

random_path random

I would also like to suggest renaming this subcommand to just be random

Old

$ pandora random_path -h
[2020-09-05 07:33:11.837477] [0x00007f34f049ff80] [debug]   Loading PRGs from file -h
terminate called after throwing an instance of 'std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >'
SIGABRT: abort
PC=0x47ce4b m=0 sigcode=0

followed by a whole bunch of error log

New

$ pandora random -h
Outputs a fasta of random paths through the PRGs
Usage: ./pandora random [OPTIONS] <PRG>

Positionals:
  <PRG> FILE [required]       PRG to generate random paths from

Options:
  -h,--help                   Print this help message and exit
  -n INT                      Number of paths to output [default: 1]
  -z,--compress               Compress the output with gzip
  -v                          Verbosity of logging. Repeat for increased verbosity

@iqbal-lab
Copy link
Collaborator

Might it be clearer to rename "check_kmergraph" to - "seq_to_path", to convert a sequence to a path in the graph?

@mbhall88
Copy link
Member Author

mbhall88 commented Sep 6, 2020

Might it be clearer to rename "check_kmergraph" to - "seq_to_path", to convert a sequence to a path in the graph?

Sure.

@mbhall88
Copy link
Member Author

mbhall88 commented Sep 6, 2020

merge_index

Old

$ pandora merge_index -h
Usage: pandora merge_index <index1> <index2> ...
Options:
        --outfile                               Filename for merged index

New

$ pandora merge_index -h
Allows multiple indicies to be merged (no compatibility check)
Usage: ./pandora merge_index [OPTIONS] <IDX>...

Positionals:
  <IDX> FILES [required]      Indicies to merge

Options:
  -h,--help                   Print this help message and exit
  -o,--outfile FILE           Filename for merged index [default: "merged_index.idx"]
  -v                          Verbosity of logging. Repeat for increased verbosity

@mbhall88 mbhall88 marked this pull request as ready for review September 6, 2020 20:08
@mbhall88
Copy link
Member Author

mbhall88 commented Sep 6, 2020

Ok. That is all the architecture setup for the new CLI. All subcommands are now implemented. Reviewing this is a big job so feel free to do it in chunks. I think the best way is to leave comments in the code review rather than in the conversation here.
I have left a whole bunch of // todo: comments scattered throughout in parts where we need a discussion.

@mbhall88
Copy link
Member Author

I have updated some descriptions and shortened some options.
I am about to start working on a PR to merge candidate regions [#228] so I would like to merge this CLI PR as soon as possible to avoid problems with the other PR comes in (i.e. conflicts with this CLI PR as it will be based off this work)

@leoisl
Copy link
Collaborator

leoisl commented Sep 21, 2020

I can try to review it tomorrow, but I am overloaded... might just be able to do it on Wednesday. I think it is better to merge, and then I review at some point where I don't have to rush the review. If there is any issue, I will then make a second CLI PR, but just from an overview of what you did, I think it should be all fine. I think my reviewing will be limited also to just look at the code, but we might not be able to capture possible bugs (if there are any), as most of the code you changed (files implementing main()) have no test coverage

@leoisl
Copy link
Collaborator

leoisl commented Sep 21, 2020

Update: I will review this PR this evening. I thought about speeding up the pandora paper pipelines and put all the 3 remaining pipelines to run simultaneously, but one thing that I learned is that pushing the cluster might actually slow things down (LSF becomes overloaded, bjobs starts not keeping logs of jobs for very long (disk access to check status slow things a lot), etc). Better to run things in a slower, yet more reliable way... so will check this tonight!

@leoisl
Copy link
Collaborator

leoisl commented Sep 21, 2020

Hello again, sorry, will just be able review tomorrow :(

Copy link
Collaborator

@leoisl leoisl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All fine, just some very minor stuff! You can choose if implement them or not

src/get_vcf_ref_main.cpp Outdated Show resolved Hide resolved
src/compare_main.cpp Outdated Show resolved Hide resolved
src/compare_main.cpp Show resolved Hide resolved
src/compare_main.cpp Show resolved Hide resolved
src/compare_main.cpp Outdated Show resolved Hide resolved
src/merge_index_main.cpp Outdated Show resolved Hide resolved
src/walk_main.cpp Outdated Show resolved Hide resolved
@leoisl
Copy link
Collaborator

leoisl commented Sep 23, 2020

As a last task for this PR to be merged, should we try to run it in a real pipeline, e.g. in your TB pipeline? I think things are fine, but as there are no code coverage on most of these code changes, I'd like to ensure that at least our normal workflow is running correctly. This task could also be pushed to later, as these changes will eventually be tested in a normal workflow...

@mbhall88
Copy link
Member Author

As a last task for this PR to be merged, should we try to run it in a real pipeline, e.g. in your TB pipeline? I think things are fine, but as there are no code coverage on most of these code changes, I'd like to ensure that at least our normal workflow is running correctly. This task could also be pushed to later, as these changes will eventually be tested in a normal workflow...

Yeah, I've already been doing this.

@mbhall88 mbhall88 merged commit a5baaf1 into iqbal-lab-org:dev Sep 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants