Skip to content

Commit

Permalink
Merge pull request #179 from qiyunzhu/0.1.5
Browse files Browse the repository at this point in the history
For version 0.1.5
  • Loading branch information
qiyunzhu authored Dec 26, 2022
2 parents f7adceb + 87af16c commit 225b0a1
Show file tree
Hide file tree
Showing 16 changed files with 304 additions and 65 deletions.
83 changes: 49 additions & 34 deletions doc/cli.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,16 @@

The command-line interface (CLI) of Woltka provides several commands:

- [**classify**](#classify): Complete classification workflow with all parameters.
- [**tools**](#tools): Utilities for working with alignments, maps and profiles.
Main workflow:
- [**classify**](#classify): Complete classification workflow that analyzes sequence alignments based on a classification system and generate profiles.

Profile utilities:
- [**collapse**](#collapse): Collapse a profile by feature mapping and/or hierarchy.
- [**normalize**](#normalize): Normalize a profile to fractions and/or by feature sizes.
- [**filter**](#filter): Filter a profile by per-sample abundance.
- [**merge**](#merge): Merge multiple profiles into one profile.
- [**coverage**](#coverage): Calculate per-sample coverage of feature groups.


## Classify

Expand Down Expand Up @@ -93,35 +101,27 @@ Option | Description
`--no-exe` | Disable calling external programs (`gzip`, `bzip2` and `xz`) for decompression. Otherwise, Woltka will use them if available for faster processing, or switch back to Python if not.


## Tools

### Filter

Filter a profile by **per-sample** abundance.

* See [Per-sample filtering](filter.md) for details.

Option | Description
--- | ---
`--input`, `-i` (required) | Path to input alignment file or directory of alignment files.
`--output`, `-o` (required) | Path to output profile file or directory of profile files.
`--min-count`, `-c` | Per-sample minimum count threshold (>=1).
`--min-percent`, `-p` | Per-sample minimum percentage threshold (<100).
## Collapse

### Merge
Collapse a profile based on feature mapping (supports **many-to-many** mapping) and/or hierarchy.

Merge multiple profiles into one profile.

* See [Merging profiles](merge.md) for details.
* See [profile collapsing](collapse.md) for details.

Option | Description
--- | ---
`--input`, `-i` (required) | Path to input profiles or directories containing profiles. Can accept multiple paths.
`--input`, `-i` (required) | Path to input profile.
`--output`, `-o` (required) | Path to output profile.
`--map`, `-m` | Path to mapping of source features to target features.
`--divide`, `-d` | Count each target feature as 1 / _k_ (_k_ is the number of targets mapped to a source). Otherwise, count as one.
`--field`, `-f` | Features are stratified (strata delimited by "\|"). For example, if features are like "species\|gene", one can use `-f 1` to collapse "species" or `-f 2` to collapse "gene".
`--nested`, `-e` | Features are nested (each field is a child of the previous field). For example, "A_1" represents "1" of "A", and the entire feature is equivalent to stratified feature "A\|A_1". This parameter overrides the "\|"-delimited strata.
`--sep`, `-s` | Field separator for stratified features (default: "\|") or nested features (default: "_").
`--names`, `-n` | Path to mapping of target features to names. The names will be appended to the collapsed profile as a metadata column.

### Normalize

Normalize a profile to fractions or by feature sizes.
## Normalize

Normalize a profile to fractions and/or by feature sizes

* See [Profile normalization](normalize.md) for details.

Expand All @@ -133,24 +133,34 @@ Option | Description
`--scale`, `-s` | Scale values by this factor. Accepts "k", "M" suffixes.
`--digits`, `-d` | Round values to this number of digits after the decimal point. If omitted, will keep decimal precision of input profile.

### Collapse

Collapse a profile based on feature mapping (supports **many-to-many** mapping) (details).
## Filter

* See [profile collapsing](collapse.md) for details.
Filter a profile by **per-sample** abundance.

* See [Per-sample filtering](filter.md) for details.

Option | Description
--- | ---
`--input`, `-i` (required) | Path to input profile.
`--map`, `-m` (required) | Path to mapping of source features to target features.
`--input`, `-i` (required) | Path to input alignment file or directory of alignment files.
`--output`, `-o` (required) | Path to output profile file or directory of profile files.
`--min-count`, `-c` | Per-sample minimum count threshold (>=1).
`--min-percent`, `-p` | Per-sample minimum percentage threshold (<100).


## Merge

Merge multiple profiles into one profile.

* See [Merging profiles](merge.md) for details.

Option | Description
--- | ---
`--input`, `-i` (required) | Path to input profiles or directories containing profiles. Can accept multiple paths.
`--output`, `-o` (required) | Path to output profile.
`--divide`, `-d` | Count each target feature as 1 / _k_ (_k_ is the number of targets mapped to a source). Otherwise, count as one.
`--field`, `-f` | Features are stratified (strata delimited by "\|"). For example, if features are like "species\|gene", one can use `-f 1` to collapse "species" or `-f 2` to collapse "gene".
`--nested`, `-e` | Features are nested (each field is a child of the previous field). For example, "A_1" represents "1" of "A", and the entire feature is equivalent to stratified feature "A\|A_1". This parameter overrides the "\|"-delimited strata.
`--sep`, `-s` | Field separator for stratified features (default: "\|") or nested features (default: "_").
`--names`, `-n` | Path to mapping of target features to names. The names will be appended to the collapsed profile as a metadata column.

### Coverage

## Coverage

Calculate per-sample coverage of feature groups in a profile.

Expand All @@ -164,3 +174,8 @@ Option | Description
`--threshold`, `-t` | Convert coverage to presence (1) / absence (0) data by this percentage threshold.
`--count`, `-c` | Record numbers of covered features instead of percentages (overrides threshold).
`--names`, `-n` | Path to mapping of feature groups to names. The names will be appended to the coverage table as a metadata column.


## Tools

A sub menu containing all commands except for `classify`. It is for backward compatibility. It is deprecated and will be removed in the next release.
6 changes: 3 additions & 3 deletions doc/collapse.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,13 +5,13 @@ The **profile collapsing** function is a lightweight and flexible addition to th
Collapse features based on a mapping file:

```bash
woltka collapse -i input.biom -m mapping.txt -o output.biom
woltka collapse -i gene.biom -m ko.map -o ko.biom
```

Collapse nested features to the first level:
Collapse nested features to the second level:

```bash
woltka collapse -i input.biom -e -f 1 -o output.biom
woltka collapse -i lineage.tsv -e -f 2 -o phylum.tsv
```


Expand Down
2 changes: 1 addition & 1 deletion doc/coverage.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ L-glutamine degradation I | 100.0 | 100.0 | 50.0 | 0.0
Sucrose biosynthesis I | 20.0 | 20.0 | 20.0 | 20.0
... |

**Note**: The "coverage" computed by Woltka is distinct from those by HUMAnN2 (whether the pathway is present) and HUMAnN3 (how likely the pathway is present).
**Note**: The "coverage" computed by Woltka is not the same as those by [HUMAnN2](https://huttenhower.sph.harvard.edu/humann2) (whether the pathway is present) and [HUMAnN3](https://huttenhower.sph.harvard.edu/humann) (how likely the pathway is present), although the usage and interpretation may be comparable.

## Parameters

Expand Down
31 changes: 23 additions & 8 deletions doc/faq.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ To date, all Woltka versions (0.1.0 to 0.1.5) generate **identical** output file

### How many CPU cores does Woltka use?

Woltka works the best with two CPU cores: one for file decompression and the other for classification. This happens automatically. See [here](perform.md#keep-external-decompressors-on) for details.
Woltka works the best with **two CPU cores**: one for file decompression and the other for classification. This happens automatically. See [here](perform.md#keep-external-decompressors-on) for details.

### Does Woltka support multiprocessing?

Expand All @@ -28,11 +28,11 @@ Not at the moment. But you can run multiple Woltka instances on different subset

### Does Woltka support gzipped alignment files?

Yes. All input files for Woltka (alignments and databases) can be supplied as compressed in gzip, bzip2 or xz formats. Woltka will automatically recognize and process them.
Yes. All input files for Woltka (alignments and databases) can be supplied as compressed in **gzip**, **bzip2** or **xz** formats. Woltka will automatically recognize and process them.

### Does Woltka support [BAM](https://en.wikipedia.org/wiki/Binary_Alignment_Map) and [CRAM](https://en.wikipedia.org/wiki/CRAM_(file_format)) formats?

Not out-of-the-box. But you can use SAMtools to extract BAM/CRAM files and directly "pipe" into Woltka, like this ("-" represents stdin):
Not out-of-the-box. But you can use SAMtools to extract BAM/CRAM files and directly "pipe" them into Woltka, like this ("-" represents stdin):

```bash
samtools view input.bam | woltka classify -i - -o output.biom
Expand Down Expand Up @@ -89,15 +89,30 @@ Woltka provides multiple normalization features. For example, if you want to get
woltka classify ... --frac
```

See [here](normalize.md) for details.
See [here](normalize.md#relative-abundance) for details.

### Can Woltka report cell values in the units of CPM, RPK, and TPM?

Yes. See [here](normalize.md) for methods.
Yes. See [here](normalize.md) for methods. For example:

```bash
woltka classify ... -c coords.txt --sizes . --scale 1k --digits 3 -o rpk.biom
woltka normalize -i rpk.biom --scale 1M -o tpm.biom
```

### I already got a gene (ORF) profile of raw counts, can I still convert it into RPK?

Yes. You can do:

```bash
woltka normalize -i orf.tsv -z coords.txt -s 1k -d 3 -o orf.rpk.tsv
```

See [here](normalize.md#abundance-of-functional-genes) for details.

### I ran Woltka separately on multiple subsets of data. Can I merge the results?

Yes. The `woltka merge` command is for you. See [here](merge.md) for details.
Yes. The `merge` command is for you. See [here](merge.md) for details.

### Can Woltka report taxon names instead of TaxIDs when using NCBI taxonomy?

Expand All @@ -123,6 +138,6 @@ See [here](classify.md#unassigned-sequences) for details.

### I attempted to collapse a stratified feature table, and the output is empty?

In this case, you will need to specify which field of a stratified feature should be collapsed, using parameter `--field` followed by field index (starting from 1), otherwise the program will try to collapse the entire feature.
In this case, you will need to specify which field of a stratified feature should be collapsed, using parameter `--field` or `-f` followed by field index (starting from 1), otherwise the program will try to collapse the entire feature.

See [here](collapse.md#stratification) for details.
See [here](collapse.md#stratified-features) for details.
2 changes: 1 addition & 1 deletion doc/gtdb.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

**GTDB** (https://gtdb.ecogenomic.org/) ([Parks et al., 2018](https://www.nature.com/articles/nbt.4229)) is a standardized taxonomy system for bacteria and archaea. It is based on the phylogenetic trees of genomes, hence reflecting the evolutionary relationships among microorganisms better, compared to conventional taxonomy. Our research shows that GTDB shares high consistency with the [WoL](https://biocore.github.io/wol/) phylogeny (which was built using more marker genes and more expensive methods) ([Zhu et al., 2019](https://www.nature.com/articles/s41467-019-13443-4)).

This tutorial will be based on GTDB [**R04-RS89**](https://data.ace.uq.edu.au/public/gtdb/data/releases/release89/) (version 4, indexed to RefSeq release 89), which was the current release at the time of writing. This release contains 143,512 bacterial and 2,392 archaeal organisms in its taxonomy, or 23,458 bacterial and 1,248 archaeal genomes in its collection of actual genome sequences. These 24,706 "**species clusters**" cover the complete GTDB taxonomic framework ([Parks et al., 2020](https://www.nature.com/articles/s41587-020-0501-8)).
This tutorial is based on GTDB [**R04-RS89**](https://data.ace.uq.edu.au/public/gtdb/data/releases/release89/) (version 4, indexed to RefSeq release 89), which was the current release at the time of writing. This release contains 143,512 bacterial and 2,392 archaeal organisms in its taxonomy, or 23,458 bacterial and 1,248 archaeal genomes in its collection of actual genome sequences. These 24,706 "**species clusters**" cover the complete GTDB taxonomic framework ([Parks et al., 2020](https://www.nature.com/articles/s41587-020-0501-8)).

This tutorial will explain various strategies of utilizing GTDB taxonomy / phylogeny in a meta'omics analysis with Woltka.

Expand Down
3 changes: 2 additions & 1 deletion doc/input.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ align/

2\. A **mapping file** of sample ID \<tab\> alignment file path. The paths must point to existing files. They can either be full paths, or simply filenames under the same directory as the mapping file. For example, one can place a `map.txt` of the following content to where alignment files are located.


```
S01 <tab> S01.sam.gz
S02 <tab> S02.sam.gz
Expand All @@ -39,7 +40,7 @@ S03 <tab> S03.sam.gz

- If you don't want demultiplexing (i.e., it is a single-sample file), add `--no-demux` to the command.

4\. "-", representing standard input (**stdin**). Using this method, you can ["pipe"](https://en.wikipedia.org/wiki/Pipeline_(Unix)) alignments generated by another program directly into Woltka. Demultiplexing is on by default (see above). For example, you can do:
4\. "`-`", representing [standard input](https://en.wikipedia.org/wiki/Standard_streams#Standard_input_(stdin)) (**stdin**). Using this method, you can "[pipe](https://en.wikipedia.org/wiki/Pipeline_(Unix))" alignments generated by another program directly into Woltka. Demultiplexing is on by default (see above). For example, you can do:

```bash
bowtie2 -x db -U input.fq | woltka classify -i - -o output.biom
Expand Down
2 changes: 1 addition & 1 deletion doc/install.md
Original file line number Diff line number Diff line change
Expand Up @@ -104,7 +104,7 @@ If in the future some dependencies have changes that are not compatible with the
conda create -n woltka python=3.10.8
conda activate woltka
conda install -c conda-forge biom-format=2.1.13
conda install -c bioconda woltka=0.1.4
conda install -c bioconda woltka=0.1.5
```

## Test
Expand Down
25 changes: 17 additions & 8 deletions doc/normalize.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,8 @@ By default, the cell values in a feature table (profile) are **counts** (**frequ
- [During or after classification](#during-or-after-classification)
- [Normalization by subject size](#normalization-by-subject-size)
- [Sequence and taxonomic abundances](#sequence-and-taxonomic-abundances)
- [Why real-time normalization matters](#Why-real-time-normalization-matters)
- [Gene coordinates as size map](#gene-coordinates-as-size-map)
- [Why real-time normalization matters](#why-real-time-normalization-matters)
- [Abundance of functional genes](#abundance-of-functional-genes)
- [Relative abundance](#relative-abundance)

Expand Down Expand Up @@ -54,12 +55,6 @@ Or post classification (on existing profiles):
woltka normalize --sizes size.map ...
```

A special case is during "coord-match" functional classification (see [details](ordina.md)), one can use a dot (`.`) instead of a mapping file. Woltka will read gene sizes from the gene coordinates file.

```bash
woltka classify -c coords.txt --sizes . ...
```

### Sequence and taxonomic abundances

A common usage of this function is to convert **sequence abundance** to **taxonomic abundance**.
Expand Down Expand Up @@ -125,6 +120,20 @@ woltka classify \

The output values are in the unit of **reads per kilobase, or RPK**. They reflect the quantity of functional genes found in the sample.

Alternatively, if one already obtained a gene (ORF) profile without normalization:

```bash
woltka classify -i indir -c coords.txt -o orf.tsv
```

On can still normalize the profile by feeding the gene coordinates file to the `normalize` command:

```bash
woltka normalize -i orf.tsv --sizes coords.txt --scale 1k --digits 3 -o orf.rpk.tsv
```

There are two notes though. First, this can only be done with the genes (ORFs) but not higher functional units ([explained](#why-real-time-normalization-matters) above). Second, there may be slight difference between some cell values generated using the two methods. This is caused by rounding imprecision when the same read is mapped to multiple genes (ORFs). To avoid this (if you are paranoid about it), you may add `--digits 3` to the `classify` command to make the numbers more precise. Nevertheless, this issue likely won't affect the analysis outcome.


### Relative abundance

Expand All @@ -146,7 +155,7 @@ woltka normalize ... (without --sizes, automatically applies --frac)

**Note**: These values are the fractions of reads assigned to each classification units versus all reads that are **assigned**. If you add `--unassigned`, the values become the fractions of reads versus all reads that are **aligned**. Woltka cannot calculate the fractions of reads versus the **original** sequencing data, since it processes alignment files instead of raw FastQ files. However, it isn't hard to do this calculation manually if you know the sequencing depth information.

This function can also convert an RPK functional profile (see above) to the unit of **TPM (transcripts per kilobase million)**:
This command can also convert an RPK functional profile (see above) to the unit of **TPM (transcripts per kilobase million)**:

```bash
woltka normalize -i rpk.biom --scale 1M -o tpm.biom
Expand Down
18 changes: 17 additions & 1 deletion doc/ogu.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@ If needed, you may convert a BIOM table into a tab-delimited file:
biom convert --to-tsv -i table.biom -o table.tsv
```

Note: [**Qiita**](https://qiita.ucsd.edu/) implements the WoL database and a Woltka workflow, which performs sequence alignment using the [SHOGUN protocol](align.md#the-shogun-protocol). If you are a Qiita user, the alignment file can be automatically generated and downloaded from the Qiita interface. See [details](qiita.md).
Note: [**Qiita**](https://qiita.ucsd.edu/) implements the WoL database and a Woltka workflow, which performs sequence alignment using the [SHOGUN protocol](align.md#the-shogun-protocol). If you are a Qiita user, the alignment file as well as the OGU table (called a "per-genome" table) can be automatically generated and downloaded from the Qiita interface. See [details](qiita.md).


## OGU analysis using QIIME 2
Expand Down Expand Up @@ -104,3 +104,19 @@ In multiple reference genome databases, subject sequences are individual nucleot
```bash
woltka classify -m nucl2g.txt -i input_dir -o output.biom
```

## OGUs from ORFs

In some scenarios you may have a profile of genes, or ORFs (open reading frames), denoted as e.g. "G000123456_789" (meaning the 789th ORF of genome G000123456). This can be generated using Woltka's "coord-match" functional classification (see [details](ordinal.md)), or through sequence alignment against genes (instead of genomes) (see [details](align.md)). You can extract genome IDs (i.e., OGUs) from ORF IDs using the [`collapse`](collapse.md) command:

```bash
woltka collapse -i orf.biom -e -f 1 -o ogu.biom
```

- In this command, `-e` means that feature IDs are nested (such as the current case); `-f 1` extracts the first field in the feature IDs.

Note however, that the resulting OGU table _is not identical to_ that generated from the [standard method](#ogu-table-generation). This is because gene-based methods automatically miss out [intergenic regions](https://en.wikipedia.org/wiki/Intergenic_region), which are usually insignificant in microbial genomes, but can still make a difference.

When should you use this approach? For certain data types, such as [metatranscriptomic data](https://en.wikipedia.org/wiki/Metatranscriptomics), which are naturally mRNA-derived, the gene-based approach may be more justified (although you still miss out unannotated genes and [non-coding transcripts](https://en.wikipedia.org/wiki/Non-coding_RNA)). You will need to decide based on the study goals.

Another more common scenario is when you want to perform a structural/functional stratification analysis. You start with ORFs, and care what these genes do and which source organisms they are from. See [details](stratify.md) about stratification.
Loading

0 comments on commit 225b0a1

Please sign in to comment.