Skip to content

Commit

Permalink
Merge pull request #268 from rmcolq/local_updates_with_unit_tests
Browse files Browse the repository at this point in the history
Local updates with unit tests
  • Loading branch information
leoisl authored Mar 31, 2021
2 parents 1f1c6d4 + 9cbbcfd commit 3aceff1
Show file tree
Hide file tree
Showing 46 changed files with 1,458 additions and 487 deletions.
18 changes: 15 additions & 3 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -105,9 +105,9 @@ example/prgs/kmer_prgs/
example/prgs/toy_prg.fa.k15.w14.idx
example/pandora-latest.simg
example/pandora_workflow
!example/msas/custom/GC00006032.fa
!example/msas/custom/GC00010897.fa
!example/pandora_workflow_data/assemblies/samples/toy_sample_1/toy_sample_1.ref.fa
!example/msas/GC00006032.fa
!example/msas/GC00010897.fa
!example/pandorupdated_prgsa_workflow_data/assemblies/samples/toy_sample_1/toy_sample_1.ref.fa
!example/pandora_workflow_data/assemblies/samples/toy_sample_2/toy_sample_2.ref.fa
!example/prgs/toy_prg.fa
!example/scripts/data/ref_to_get_reads_from.toy_example_1.fa
Expand All @@ -118,3 +118,15 @@ build_portable_executable
pandora-linux-precompiled

/cmake-build-release/
example/pandora_discover
example/venv
/example/msas_output/
/example/make_prg_0.2.0_prototype
/example/pandora-linux-precompiled-v0.9.0
/example/pandora_discover_out/
/example/prgs/
/example/output_toy_example_with_denovo/
/example/updated_prgs/
/example/pandora-linux-precompiled-v0.9.0.gz

example/pandora-linux-precompiled-pandora_paper_tag1
3 changes: 3 additions & 0 deletions .gitmodules
Original file line number Diff line number Diff line change
Expand Up @@ -4,3 +4,6 @@
[submodule "thirdparty/cgranges"]
path = thirdparty/cgranges
url = https://github.com/lh3/cgranges
[submodule "thirdparty/seqan"]
path = thirdparty/seqan
url = https://github.com/seqan/seqan.git
14 changes: 14 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,20 @@ project adheres to

## [Unreleased]

## [0.9.0]

### Changed
- `pandora discover` now receives read index files describing samples and reads, and discover denovo sequences in these samples.
To improve performance on discovering denovo sequences on several samples, `pandora discover` is now multithreaded, but
the performance is still the same as the previous version, i.e. each sample is processed in a single-threaded way;
- `pandora discover` output changed to a proprietary format. See [example](example) for the new output;
- `pandora` can now communicate with a [`make_prg` prototype](https://github.com/leoisl/make_prg) that is able to update PRGs
without needing to realign and remake the PRG. This provides major performance upgrades to running the full `pandora` pipeline
with denovo discovery enabled, and there is no need anymore to use a `snakemake` pipeline
(see [this example](example/run_pandora.sh) to how to run the full pipeline);
- We now use [musl libc](https://musl.libc.org/) instead of [Holy Build Box](https://github.com/phusion/holy-build-box)
to build a precompiled portable binary, removing the dependency on `OpenMP 4.0+` or `GCC 4.9+`, and `GLIBC`;

## [0.8.0]

### Added
Expand Down
14 changes: 13 additions & 1 deletion CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ HunterGate(

# project configuration
set(PROJECT_NAME_STR pandora)
project(${PROJECT_NAME_STR} VERSION "0.8.0" LANGUAGES C CXX)
project(${PROJECT_NAME_STR} VERSION "0.9.0" LANGUAGES C CXX)
configure_file( include/version.h.in ${CMAKE_BINARY_DIR}/include/version.h )

# add or not feature to print the stack trace
Expand Down Expand Up @@ -104,10 +104,18 @@ set(Boost_USE_STATIC_LIBS ON)
########################################################################################################################
# PANDORA INSTALLATION
########################################################################################################################
# allows Seqan to be found
list(APPEND CMAKE_PREFIX_PATH "${PROJECT_SOURCE_DIR}/thirdparty/seqan/util/cmake")
set(SEQAN_INCLUDE_PATH "${PROJECT_SOURCE_DIR}/thirdparty/seqan/include")

# Load the SeqAn module and fail if not found
find_package (SeqAn REQUIRED)

#include directories as SYSTEM includes, thus warnings will be ignored for these
include_directories(SYSTEM
${CMAKE_BINARY_DIR}/include
${PROJECT_SOURCE_DIR}/thirdparty/cgranges/cpp
${SEQAN_INCLUDE_DIRS}
)

# normal includes: warnings will be reported for these
Expand All @@ -118,6 +126,9 @@ include_directories(
${PROJECT_SOURCE_DIR}/thirdparty/src
)

# Add definitions set by find_package (SeqAn).
add_definitions (${SEQAN_DEFINITIONS})

file(GLOB_RECURSE SRC_FILES
${PROJECT_SOURCE_DIR}/src/*.cpp
${PROJECT_SOURCE_DIR}/src/*/*.cpp
Expand All @@ -141,6 +152,7 @@ target_link_libraries(${PROJECT_NAME}
${CMAKE_DL_LIBS}
${STATIC_C_CXX}
${BACKWARD_LIBRARIES}
${SEQAN_LIBRARIES}
)

enable_testing()
Expand Down
37 changes: 10 additions & 27 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@
- [Quick Start](#quick-start)
- [Hands-on toy example](#hands-on-toy-example)
- [Installation](#installation)
- [Precompiled portable binary](#no-installation-needed---precompiled-portable-binary)
- [Containers](#containers)
- [Installation from source](#installation-from-source)
- [Usage](#usage)
Expand All @@ -27,7 +28,7 @@ Pandora is a tool for bacterial genome analysis using a pangenome reference grap
The PanRG is a collection of 'floating'
local graphs (PRGs), each representing some orthologous region of interest
(e.g. genes, mobile elements or intergenic regions). See
https://github.com/rmcolq/make_prg for a pipeline which can construct
https://github.com/leoisl/make_prg for a tool which can construct
these PanRGs from a set of aligned sequence files.

Pandora can do the following for a single sample (read dataset):
Expand Down Expand Up @@ -66,51 +67,35 @@ pandora map <panrg.fa> <reads.fq>
## Hands-on toy example

You can test `pandora` on a toy example following [this link](example).
There is no need to have `pandora` installed, as it is run inside containers.
**There is no need to have `pandora` installed.**

## Installation

### No installation needed - precompiled portable binary

You can use `pandora` with no installation at all by simply downloading the precompiled binary, and running it.
In this binary, all libraries are linked statically, except for OpenMP.

* **Requirements**
* The only dependency required to run the precompiled binary is OpenMP 4.0+;
* The easiest way to install OpenMP 4.0+ is to have GCC 4.9 (from April 22, 2014) or more recent installed, which supports OpenMP 4.0;
* Technical details on why OpenMP can't be linked statically
can be found [here](https://gcc.gnu.org/onlinedocs/gfortran/OpenMP.html).
In this binary, all libraries are linked statically.

* **Download**:
```
wget https://github.com/rmcolq/pandora/releases/download/0.8.0/pandora-linux-precompiled-v0.8.0
wget https://github.com/rmcolq/pandora/releases/download/0.9.0/pandora-linux-precompiled-v0.9.0
```

* **Running**:
```
chmod +x pandora-linux-precompiled-v0.8.0
./pandora-linux-precompiled-v0.8.0 -h
chmod +x pandora-linux-precompiled-v0.9.0
./pandora-linux-precompiled-v0.9.0 -h
```

* **Compatibility**: This precompiled binary works on pretty much any glibc-2.12-or-later-based x86 and x86-64 Linux distribution
released since approx 2011. A non-exhaustive list: Debian >= 7, Ubuntu >= 10.10, Red Hat Enterprise Linux >= 6,
CentOS >= 6;

* **Credits**:
* Precompilation is done using [Holy Build Box](http://phusion.github.io/holy-build-box/);
* We acknowledge Páll Melsted since we followed his [blog post](https://pmelsted.wordpress.com/2015/10/14/building-binaries-for-bioinformatics/) to build this portable binary.

* **Notes**:
* We provide precompiled binaries for Linux OS only;
* The performance of precompiled binaries is several times slower than a binary compiled from source.
The main reason is that the precompiled binary can't contain specific instructions that might speed up
the execution on specific processors, as it has to be runnable on a wide range of systems;

### Containers

![Docker Cloud Build Status](https://img.shields.io/docker/cloud/build/rmcolq/pandora)
[![Docker Repository on Quay](https://quay.io/repository/rmcolq/pandora/status "Docker Repository on Quay")](https://quay.io/repository/rmcolq/pandora)

You can also download a containerized image of Pandora.
Pandora is hosted on Dockerhub and images can be downloaded with the
Pandora is hosted on Quay and images can be downloaded with the
command:

```
Expand All @@ -123,8 +108,6 @@ Alternatively, using singularity:
singularity pull docker://quay.io/rmcolq/pandora
```

NB For consistency, we no longer maintain images on singularity hub.

### Installation from source

This is the hardest way to install `pandora`, but that yields the most optimised binary.
Expand Down
9 changes: 4 additions & 5 deletions doc/Usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,8 +39,7 @@ graphs, one entry for each gene/ genome region of interest. If you
haven't, you will need a multiple sequence alignment for each graph.
Precompiled collections of MSA representing othologous gene clusters for
a number of species can be downloaded from [here](http://pangenome.de/)
and converted to graphs using the pipeline from
[here](https://github.com/rmcolq/make_prg).
and converted to graphs using [make_prg](https://github.com/leoisl/make_prg).

# Build index

Expand Down Expand Up @@ -193,19 +192,19 @@ Genotyping:
-G,--gt-conf INT Minimum genotype confidence (GT_CONF) required to make a call [default: 1]
```

# Discover novel variants
# Discover novel variants in several samples

This will look for regions in the pangraph where the reads do not map
and attempt to locally assemble these regions to find novel variants.

```
$ pandora discover --help
Quasi-map reads to an indexed PRG, infer the sequence of present loci in the sample and discover novel variants.
Usage: pandora discover [OPTIONS] <TARGET> <QUERY>
Usage: pandora discover [OPTIONS] <TARGET> <QUERY_IDX>
Positionals:
<TARGET> FILE [required] An indexed PRG file (in fasta format)
<QUERY> FILE [required] Fast{a,q} file containing reads to quasi-map
<QUERY_IDX> FILE [required] A tab-delimited file where each line is a sample identifier followed by the path to the fast{a,q} of reads for that sample
Options:
-h,--help Print this help message and exit
Expand Down
82 changes: 46 additions & 36 deletions example/README.md
Original file line number Diff line number Diff line change
@@ -1,68 +1,78 @@
# Toy example

Here we present a walkthrough of running `pandora` on a toy example. We run: 1) `pandora` without de novo discovery; 2) [`pandora` workflow](https://github.com/iqbal-lab-org/pandora_workflow), which runs `pandora` with and without de novo discovery (see Figure 2 of [our paper][pandora_2020_paper]). Although method 2) runs both modes of `pandora`, it is much more involved than method 1), as the user needs to configure and run a `snakemake` pipeline. If no de novo discovery is required, method 1 is a lot simpler to run (two commands) as opposed to running the pipeline. For completeness, we show both methods.
Here we present a walkthrough of running `pandora` on a toy example. We run:
1) `pandora` without de novo discovery;
2) `pandora` with de novo discovery (see Figure 2 of [our paper][pandora_2020_paper]).

## Input data description
## Dependencies
* **There is no need to have `pandora` or `make_prg` installed. The running script will automatically download
and run the precompiled binaries**;
* `MAFFT` has to be in your `PATH` in order to run `make_prg update`. It can be installed:
1. from source: https://mafft.cbrc.jp/alignment/software/;
2. using conda: `conda install -c bioconda mafft`;
* `wget`;
* `make_prg` requirements: `GLIBC >= 2.17` (present on `Ubuntu >= 13.04`, `Debian >= 8.0`, `CentOS >= 7`, `RHEL >= 7.9`,
`Fedora >= 19`, etc);

```
msas/ : contains the MSAs of the two genes we are using as toy example here, GC00006032 and GC00010897;
prgs/toy_prg.fa : contains the PRGs of the two genes, GC00006032 and GC00010897. These are fairly simple PRGs. GC00006032 contains 4 variant sites, each representing a SNP, while GC00010897 contains 5;
reads/ : contains 100x of perfect simulated reads from two toy samples. We simulated perfect reads, where one sample genotypes to one allele of the variant sites, while the other sample genotypes towards the other allele;
pandora_workflow_data/ : contains other input and configuration files to run the pandora workflow;
```

## pandora without de novo discovery
## Input data description

### Dependencies
* `msas/` : contains the MSAs of the two genes we are using as toy example here, GC00006032 and GC00010897;
* `reads/` : contains 100x of perfect simulated reads from two toy samples. We simulated perfect reads, where one sample genotypes to one allele of the variant sites, while the other sample genotypes towards the other allele;

* `md5sum`, `wget`, `GCC` 4.9+ (see [why](../README.md#no-installation-needed---precompiled-portable-binary)).
## Running

### Running
```
cd example && ./run_pandora_nodenovo.sh
./run_pandora.sh
```

### Quick look at the output

`pandora` output will be located in directory `output_toy_example_no_denovo`.
`prgs`: contains output of `make_prg from_msa` and `pandora index`. Main files:
* `pangenome.prg.fa`: the PRG itself;
* `pangenome.prg.fa.k15.w14.idx` and `kmer_prgs`: the PRG index;
* `pangenome.update_DS`: update data structures that make the PRG updateable;

Taking a quick look at an excerpt of the genotyped VCF (`output_toy_example_no_denovo/pandora_multisample_genotyped.vcf`):
`pandora_discover_out`: contains the output of `pandora discover`. Main files:
* `denovo_paths.txt`: describes the denovo paths found in all samples;

```
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT toy_sample_1 toy_sample_2
GC00006032 146 . C T . . SVTYPE=SNP;GRAPHTYPE=SIMPLE GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF 0:41,0:52,0:41,0:52,0:83,0:105,0:0,1:-18.7786,-526.281:507.502 1:0,15:0,15:0,15:0,15:0,31:0,31:1,0:-214.155,-3.53065:210.624
GC00006032 160 . A C . . SVTYPE=SNP;GRAPHTYPE=SIMPLE GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF 1:14,29:17,42:0,36:0,51:42,148:52,212:0.666667,0.2:-366.08,-159.943:206.137 0:12,0:8,0:18,0:9,0:38,0:24,0:0.333333,1:-20.2506,-168.103:147.853
GC00006032 218 . T C . . SVTYPE=SNP;GRAPHTYPE=SIMPLE GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF 1:3,11:4,14:0,11:0,14:12,23:16,28:0.75,0:-182.162,-41.9443:140.217 0:11,0:5,0:13,0:6,0:44,0:21,0:0.25,1:-19.9705,-149.683:129.712
```
`updated_prgs`: contains the output of `make_prg update` and `pandora index` (on the updated PRG).
The files are similar to the ones in the `prgs` folder;

We can see samples `toy_sample_1` and `toy_sample_2` genotype towards different alleles.
`output_toy_example_no_denovo` and `output_toy_example_with_denovo`: contains the output of
`pandora compare` without denovo discovery and with denovo discovery, respectively. Main files:
* `pandora_multisample.matrix`: see https://github.com/rmcolq/pandora/wiki/FAQ#q-where-can-i-find-gene-presenceabsence-information ;
* `pandora_multisample.vcf_ref.fa`: see https://github.com/rmcolq/pandora/wiki/FAQ#q-what-are-the-sequences-in-pandora_multisamplevcf_reffa
* `pandora_multisample_genotyped.vcf`: the VCF file containing variants for all samples;

## pandora workflow

### Dependencies
### Looking at the genotyped VCFs

* [`singularity`](https://sylabs.io/), `git`, `python 3.6+`
**No denovo**

Taking a quick look at an excerpt of `output_toy_example_no_denovo/pandora_multisample_genotyped.vcf`
(the VCF genotyped by `pandora` without denovo sequences):

### Running
```
cd example && ./run_pandora_workflow.sh
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT toy_sample_1 toy_sample_2
GC00006032 146 . T C . . SVTYPE=SNP;GRAPHTYPE=SIMPLE GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF 1:0,41:0,52:0,41:0,52:0,83:0,105:1,0:-526.281,-18.7786:507.502 0:15,0:15,0:15,0:15,0:31,0:31,0:0,1:-3.53065,-214.155:210.624
GC00006032 160 . A C . . SVTYPE=SNP;GRAPHTYPE=SIMPLE GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF 1:0,26:0,40:0,33:0,50:0,106:0,160:1,0.25:-401.941,-17.9221:384.019 0:19,0:12,0:19,0:12,0:38,0:24,0:0,1:-3.32705,-218.76:215.433
GC00006032 218 . T C . . SVTYPE=SNP;GRAPHTYPE=SIMPLE GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF 1:3,11:4,14:0,11:0,14:12,23:16,28:0.75,0:-182.162,-41.9443:140.217 0:11,0:5,0:13,0:6,0:44,0:21,0:0.25,1:-19.9705,-149.683:129.712
```

### Quick look at the output

`pandora` workflow output will be located at dir `pandora_workflow/output_toy_example_workflow/illumina/100x/random/compare_(no|with)denovo_global_genotyping`.
We can see samples `toy_sample_1` and `toy_sample_2` genotype towards different alleles.

Files `pandora_workflow/output_toy_example_workflow/illumina/100x/random/compare_nodenovo_global_genotyping/pandora_multisample_genotyped_global.vcf` and `output_toy_example_no_denovo/pandora_multisample_genotyped.vcf` both represent running `pandora` without de novo discovery and should be very similar files, just differentiating on some header lines, and on some statistics, due to slightly different versions used. For the first file, it is the version used in the paper; for the second file, the version on the `master` branch, which is more recent, with some bugs fixed.
**With denovo**

File `pandora_workflow/output_toy_example_workflow/illumina/100x/random/compare_withdenovo_global_genotyping/pandora_multisample_genotyped_global.vcf` is the `pandora` VCF with de novo discovery, and it has some new VCF records that were discovered and genotyped. For example:
The VCF (`output_toy_example_with_denovo/pandora_multisample_genotyped.vcf`) has some new variants that were discovered and genotyped. For example:

```
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT toy_sample_1.100x.random.illumina toy_sample_2.100x.random.illumina
GC00006032 49 . G A . . SVTYPE=SNP;GRAPHTYPE=SIMPLE GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF 1:0,59:0,44:0,59:0,44:0,59:0,44:1,0:-570.333,-26.8805:543.452 0:50,0:48,0:50,0:48,0:100,0:97,0:0,1:-28.9415,-537.307:508.365
GC00010897 44 . C T . . SVTYPE=SNP;GRAPHTYPE=SIMPLE GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF 1:0,16:0,11:0,16:0,11:0,32:0,23:1,0:-220.34,-8.03511:212.304 0:18,0:22,0:18,0:22,0:37,0:44,0:0,1:-2.87264,-270.207:267.334
GC00010897 422 . A T . . SVTYPE=SNP;GRAPHTYPE=SIMPLE GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF 1:0,5:0,8:0,5:0,8:0,11:0,16:1,0:-155.867,-20.2266:135.641 0:9,0:12,0:9,0:12,0:9,0:12,0:0,1:-9.39494,-182.709:173.314
GC00006032 49 . A G . . SVTYPE=SNP;GRAPHTYPE=SIMPLE GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF 0:44,0:59,0:44,0:59,0:44,0:59,0:0,1:-26.8805,-570.333:543.452 1:0,48:0,50:0,48:0,50:0,97:0,100:1,0:-537.307,-28.9415:508.365
GC00010897 44 . C T . . SVTYPE=SNP;GRAPHTYPE=SIMPLE GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF 1:0,11:0,16:0,11:0,16:0,23:0,32:1,0:-220.34,-8.03511:212.304 0:22,0:18,0:22,0:18,0:44,0:37,0:0,1:-2.87264,-270.207:267.334
GC00010897 422 . A T . . SVTYPE=SNP;GRAPHTYPE=SIMPLE GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF 1:0,8:0,5:0,8:0,5:0,16:0,11:1,0:-155.867,-20.2266:135.641 0:12,0:9,0:12,0:9,0:12,0:9,0:0,1:-9.39494,-182.709:173.314
```


<!--Link References-->

[pandora_2020_paper]: https://www.biorxiv.org/content/10.1101/2020.11.12.380378v2
File renamed without changes.
File renamed without changes.
1 change: 0 additions & 1 deletion example/pandora-linux-precompiled-v0.8.0-alpha.md5sum.txt

This file was deleted.

This file was deleted.

Loading

0 comments on commit 3aceff1

Please sign in to comment.