Merge pull request #179 from qiyunzhu/0.1.5

For version 0.1.5
qiyunzhu · Dec 26, 2022 · 225b0a1 · 225b0a1
2 parents f7adceb + 87af16c
commit 225b0a1
Show file tree

Hide file tree

Showing 16 changed files with 304 additions and 65 deletions.
diff --git a/doc/cli.md b/doc/cli.md
@@ -2,8 +2,16 @@
 
 The command-line interface (CLI) of Woltka provides several commands:
 
-- [**classify**](#classify): Complete classification workflow with all parameters.
-- [**tools**](#tools): Utilities for working with alignments, maps and profiles.
+Main workflow:
+- [**classify**](#classify): Complete classification workflow that analyzes sequence alignments based on a classification system and generate profiles.
+
+Profile utilities:
+- [**collapse**](#collapse): Collapse a profile by feature mapping and/or hierarchy.
+- [**normalize**](#normalize): Normalize a profile to fractions and/or by feature sizes.
+- [**filter**](#filter): Filter a profile by per-sample abundance.
+- [**merge**](#merge): Merge multiple profiles into one profile.
+- [**coverage**](#coverage): Calculate per-sample coverage of feature groups.
+
 
 ## Classify
 
@@ -93,35 +101,27 @@ Option | Description
 `--no-exe` | Disable calling external programs (`gzip`, `bzip2` and `xz`) for decompression. Otherwise, Woltka will use them if available for faster processing, or switch back to Python if not.
 
 
-## Tools
-
-### Filter
-
-Filter a profile by **per-sample** abundance.
-
-* See [Per-sample filtering](filter.md) for details.
-
-Option | Description
---- | ---
-`--input`, `-i` (required) | Path to input alignment file or directory of alignment files.
-`--output`, `-o` (required) | Path to output profile file or directory of profile files.
-`--min-count`, `-c` | Per-sample minimum count threshold (>=1).
-`--min-percent`, `-p` | Per-sample minimum percentage threshold (<100).
+## Collapse
 
-### Merge
+Collapse a profile based on feature mapping (supports **many-to-many** mapping) and/or hierarchy.
 
-Merge multiple profiles into one profile.
-
-* See [Merging profiles](merge.md) for details.
+* See [profile collapsing](collapse.md) for details.
 
 Option | Description
 --- | ---
-`--input`, `-i` (required) | Path to input profiles or directories containing profiles. Can accept multiple paths.
+`--input`, `-i` (required) | Path to input profile.
 `--output`, `-o` (required) | Path to output profile.
+`--map`, `-m` | Path to mapping of source features to target features.
+`--divide`, `-d` | Count each target feature as 1 / _k_ (_k_ is the number of targets mapped to a source). Otherwise, count as one.
+`--field`, `-f` | Features are stratified (strata delimited by "\|"). For example, if features are like "species\|gene", one can use `-f 1` to collapse "species" or `-f 2` to collapse "gene".
+`--nested`, `-e` | Features are nested (each field is a child of the previous field). For example, "A_1" represents "1" of "A", and the entire feature is equivalent to stratified feature "A\|A_1". This parameter overrides the "\|"-delimited strata.
+`--sep`, `-s` | Field separator for stratified features (default: "\|") or nested features (default: "_").
+`--names`, `-n` | Path to mapping of target features to names. The names will be appended to the collapsed profile as a metadata column.
 
-### Normalize
 
-Normalize a profile to fractions or by feature sizes.
+## Normalize
+
+Normalize a profile to fractions and/or by feature sizes
 
 * See [Profile normalization](normalize.md) for details.
 
@@ -133,24 +133,34 @@ Option | Description
 `--scale`, `-s` | Scale values by this factor. Accepts "k", "M" suffixes.
 `--digits`, `-d` | Round values to this number of digits after the decimal point. If omitted, will keep decimal precision of input profile.
 
-### Collapse
 
-Collapse a profile based on feature mapping (supports **many-to-many** mapping) (details).
+## Filter
 
-* See [profile collapsing](collapse.md) for details.
+Filter a profile by **per-sample** abundance.
+
+* See [Per-sample filtering](filter.md) for details.
 
 Option | Description
 --- | ---
-`--input`, `-i` (required) | Path to input profile.
-`--map`, `-m` (required) | Path to mapping of source features to target features.
+`--input`, `-i` (required) | Path to input alignment file or directory of alignment files.
+`--output`, `-o` (required) | Path to output profile file or directory of profile files.
+`--min-count`, `-c` | Per-sample minimum count threshold (>=1).
+`--min-percent`, `-p` | Per-sample minimum percentage threshold (<100).
+
+
+## Merge
+
+Merge multiple profiles into one profile.
+
+* See [Merging profiles](merge.md) for details.
+
+Option | Description
+--- | ---
+`--input`, `-i` (required) | Path to input profiles or directories containing profiles. Can accept multiple paths.
 `--output`, `-o` (required) | Path to output profile.
-`--divide`, `-d` | Count each target feature as 1 / _k_ (_k_ is the number of targets mapped to a source). Otherwise, count as one.
-`--field`, `-f` | Features are stratified (strata delimited by "\|"). For example, if features are like "species\|gene", one can use `-f 1` to collapse "species" or `-f 2` to collapse "gene".
-`--nested`, `-e` | Features are nested (each field is a child of the previous field). For example, "A_1" represents "1" of "A", and the entire feature is equivalent to stratified feature "A\|A_1". This parameter overrides the "\|"-delimited strata.
-`--sep`, `-s` | Field separator for stratified features (default: "\|") or nested features (default: "_").
-`--names`, `-n` | Path to mapping of target features to names. The names will be appended to the collapsed profile as a metadata column.
 
-### Coverage
+
+## Coverage
 
 Calculate per-sample coverage of feature groups in a profile.
 
@@ -164,3 +174,8 @@ Option | Description
 `--threshold`, `-t` | Convert coverage to presence (1) / absence (0) data by this percentage threshold.
 `--count`, `-c` | Record numbers of covered features instead of percentages (overrides threshold).
 `--names`, `-n` | Path to mapping of feature groups to names. The names will be appended to the coverage table as a metadata column.
+
+
+## Tools
+
+A sub menu containing all commands except for `classify`. It is for backward compatibility. It is deprecated and will be removed in the next release.
diff --git a/doc/collapse.md b/doc/collapse.md
@@ -5,13 +5,13 @@ The **profile collapsing** function is a lightweight and flexible addition to th
 Collapse features based on a mapping file:
 
 ```bash
-woltka collapse -i input.biom -m mapping.txt -o output.biom
+woltka collapse -i gene.biom -m ko.map -o ko.biom
 ```
 
-Collapse nested features to the first level:
+Collapse nested features to the second level:
 
 ```bash
-woltka collapse -i input.biom -e -f 1 -o output.biom
+woltka collapse -i lineage.tsv -e -f 2 -o phylum.tsv
 ```
 
 

diff --git a/doc/coverage.md b/doc/coverage.md
@@ -38,7 +38,7 @@ L-glutamine degradation I | 100.0 | 100.0 | 50.0 | 0.0
 Sucrose biosynthesis I | 20.0 | 20.0 | 20.0 | 20.0
 ... |
 
-**Note**: The "coverage" computed by Woltka is distinct from those by HUMAnN2 (whether the pathway is present) and HUMAnN3 (how likely the pathway is present).
+**Note**: The "coverage" computed by Woltka is not the same as those by [HUMAnN2](https://huttenhower.sph.harvard.edu/humann2) (whether the pathway is present) and [HUMAnN3](https://huttenhower.sph.harvard.edu/humann) (how likely the pathway is present), although the usage and interpretation may be comparable.
 
 ## Parameters
 

diff --git a/doc/faq.md b/doc/faq.md
@@ -17,7 +17,7 @@ To date, all Woltka versions (0.1.0 to 0.1.5) generate **identical** output file
 
 ### How many CPU cores does Woltka use?
 
-Woltka works the best with two CPU cores: one for file decompression and the other for classification. This happens automatically. See [here](perform.md#keep-external-decompressors-on) for details.
+Woltka works the best with **two CPU cores**: one for file decompression and the other for classification. This happens automatically. See [here](perform.md#keep-external-decompressors-on) for details.
 
 ### Does Woltka support multiprocessing?
 
@@ -28,11 +28,11 @@ Not at the moment. But you can run multiple Woltka instances on different subset
 
 ### Does Woltka support gzipped alignment files?
 
-Yes. All input files for Woltka (alignments and databases) can be supplied as compressed in gzip, bzip2 or xz formats. Woltka will automatically recognize and process them.
+Yes. All input files for Woltka (alignments and databases) can be supplied as compressed in **gzip**, **bzip2** or **xz** formats. Woltka will automatically recognize and process them.
 
 ### Does Woltka support [BAM](https://en.wikipedia.org/wiki/Binary_Alignment_Map) and [CRAM](https://en.wikipedia.org/wiki/CRAM_(file_format)) formats?
 
-Not out-of-the-box. But you can use SAMtools to extract BAM/CRAM files and directly "pipe" into Woltka, like this ("-" represents stdin):
+Not out-of-the-box. But you can use SAMtools to extract BAM/CRAM files and directly "pipe" them into Woltka, like this ("-" represents stdin):
 
 ```bash
 samtools view input.bam | woltka classify -i - -o output.biom
@@ -89,15 +89,30 @@ Woltka provides multiple normalization features. For example, if you want to get
 woltka classify ... --frac
 ```
 
-See [here](normalize.md) for details.
+See [here](normalize.md#relative-abundance) for details.
 
 ### Can Woltka report cell values in the units of CPM, RPK, and TPM?
 
-Yes. See [here](normalize.md) for methods.
+Yes. See [here](normalize.md) for methods. For example:
+
+```bash
+woltka classify ... -c coords.txt --sizes . --scale 1k --digits 3 -o rpk.biom
+woltka normalize -i rpk.biom --scale 1M -o tpm.biom
+```
+
+### I already got a gene (ORF) profile of raw counts, can I still convert it into RPK?
+
+Yes. You can do:
+
+```bash
+woltka normalize -i orf.tsv -z coords.txt -s 1k -d 3 -o orf.rpk.tsv
+```
+
+See [here](normalize.md#abundance-of-functional-genes) for details.
 
 ### I ran Woltka separately on multiple subsets of data. Can I merge the results?
 
-Yes. The `woltka merge` command is for you. See [here](merge.md) for details.
+Yes. The `merge` command is for you. See [here](merge.md) for details.
 
 ### Can Woltka report taxon names instead of TaxIDs when using NCBI taxonomy?
 
@@ -123,6 +138,6 @@ See [here](classify.md#unassigned-sequences) for details.
 
 ### I attempted to collapse a stratified feature table, and the output is empty?
 
-In this case, you will need to specify which field of a stratified feature should be collapsed, using parameter `--field` followed by field index (starting from 1), otherwise the program will try to collapse the entire feature.
+In this case, you will need to specify which field of a stratified feature should be collapsed, using parameter `--field` or `-f` followed by field index (starting from 1), otherwise the program will try to collapse the entire feature.
 
-See [here](collapse.md#stratification) for details.
+See [here](collapse.md#stratified-features) for details.
diff --git a/doc/gtdb.md b/doc/gtdb.md
@@ -2,7 +2,7 @@
 
 **GTDB** (https://gtdb.ecogenomic.org/) ([Parks et al., 2018](https://www.nature.com/articles/nbt.4229)) is a standardized taxonomy system for bacteria and archaea. It is based on the phylogenetic trees of genomes, hence reflecting the evolutionary relationships among microorganisms better, compared to conventional taxonomy. Our research shows that GTDB shares high consistency with the [WoL](https://biocore.github.io/wol/) phylogeny (which was built using more marker genes and more expensive methods) ([Zhu et al., 2019](https://www.nature.com/articles/s41467-019-13443-4)).
 
-This tutorial will be based on GTDB [**R04-RS89**](https://data.ace.uq.edu.au/public/gtdb/data/releases/release89/) (version 4, indexed to RefSeq release 89), which was the current release at the time of writing. This release contains 143,512 bacterial and 2,392 archaeal organisms in its taxonomy, or 23,458 bacterial and 1,248 archaeal genomes in its collection of actual genome sequences. These 24,706 "**species clusters**" cover the complete GTDB taxonomic framework ([Parks et al., 2020](https://www.nature.com/articles/s41587-020-0501-8)).
+This tutorial is based on GTDB [**R04-RS89**](https://data.ace.uq.edu.au/public/gtdb/data/releases/release89/) (version 4, indexed to RefSeq release 89), which was the current release at the time of writing. This release contains 143,512 bacterial and 2,392 archaeal organisms in its taxonomy, or 23,458 bacterial and 1,248 archaeal genomes in its collection of actual genome sequences. These 24,706 "**species clusters**" cover the complete GTDB taxonomic framework ([Parks et al., 2020](https://www.nature.com/articles/s41587-020-0501-8)).
 
 This tutorial will explain various strategies of utilizing GTDB taxonomy / phylogeny in a meta'omics analysis with Woltka.
 

diff --git a/doc/input.md b/doc/input.md
@@ -28,6 +28,7 @@ align/
 
 2\. A **mapping file** of sample ID \<tab\> alignment file path. The paths must point to existing files. They can either be full paths, or simply filenames under the same directory as the mapping file. For example, one can place a `map.txt` of the following content to where alignment files are located.
 
+
 ```
 S01 <tab> S01.sam.gz
 S02 <tab> S02.sam.gz
@@ -39,7 +40,7 @@ S03 <tab> S03.sam.gz
 
 - If you don't want demultiplexing (i.e., it is a single-sample file), add `--no-demux` to the command.
 
-4\. "-", representing standard input (**stdin**). Using this method, you can ["pipe"](https://en.wikipedia.org/wiki/Pipeline_(Unix)) alignments generated by another program directly into Woltka. Demultiplexing is on by default (see above). For example, you can do:
+4\. "`-`", representing [standard input](https://en.wikipedia.org/wiki/Standard_streams#Standard_input_(stdin)) (**stdin**). Using this method, you can "[pipe](https://en.wikipedia.org/wiki/Pipeline_(Unix))" alignments generated by another program directly into Woltka. Demultiplexing is on by default (see above). For example, you can do:
 
 ```bash
 bowtie2 -x db -U input.fq | woltka classify -i - -o output.biom

diff --git a/doc/install.md b/doc/install.md
@@ -104,7 +104,7 @@ If in the future some dependencies have changes that are not compatible with the
 conda create -n woltka python=3.10.8
 conda activate woltka
 conda install -c conda-forge biom-format=2.1.13
-conda install -c bioconda woltka=0.1.4
+conda install -c bioconda woltka=0.1.5
 ```
 
 ## Test

diff --git a/doc/normalize.md b/doc/normalize.md
@@ -13,7 +13,8 @@ By default, the cell values in a feature table (profile) are **counts** (**frequ
 - [During or after classification](#during-or-after-classification)
 - [Normalization by subject size](#normalization-by-subject-size)
   - [Sequence and taxonomic abundances](#sequence-and-taxonomic-abundances)
-  - [Why real-time normalization matters](#Why-real-time-normalization-matters)
+  - [Gene coordinates as size map](#gene-coordinates-as-size-map)
+  - [Why real-time normalization matters](#why-real-time-normalization-matters)
   - [Abundance of functional genes](#abundance-of-functional-genes)
 - [Relative abundance](#relative-abundance)
 
@@ -54,12 +55,6 @@ Or post classification (on existing profiles):
 woltka normalize --sizes size.map ...
 ```
 
-A special case is during "coord-match" functional classification (see [details](ordina.md)), one can use a dot (`.`) instead of a mapping file. Woltka will read gene sizes from the gene coordinates file.
-
-```bash
-woltka classify -c coords.txt --sizes . ...
-```
-
 ### Sequence and taxonomic abundances
 
 A common usage of this function is to convert **sequence abundance** to **taxonomic abundance**.
@@ -125,6 +120,20 @@ woltka classify \
 
 The output values are in the unit of **reads per kilobase, or RPK**. They reflect the quantity of functional genes found in the sample.
 
+Alternatively, if one already obtained a gene (ORF) profile without normalization:
+
+```bash
+woltka classify -i indir -c coords.txt -o orf.tsv
+```
+
+On can still normalize the profile by feeding the gene coordinates file to the `normalize` command:
+
+```bash
+woltka normalize -i orf.tsv --sizes coords.txt --scale 1k --digits 3 -o orf.rpk.tsv
+```
+
+There are two notes though. First, this can only be done with the genes (ORFs) but not higher functional units ([explained](#why-real-time-normalization-matters) above). Second, there may be slight difference between some cell values generated using the two methods. This is caused by rounding imprecision when the same read is mapped to multiple genes (ORFs). To avoid this (if you are paranoid about it), you may add `--digits 3` to the `classify` command to make the numbers more precise. Nevertheless, this issue likely won't affect the analysis outcome.
+
 
 ### Relative abundance
 
@@ -146,7 +155,7 @@ woltka normalize ... (without --sizes, automatically applies --frac)
 
 **Note**: These values are the fractions of reads assigned to each classification units versus all reads that are **assigned**. If you add `--unassigned`, the values become the fractions of reads versus all reads that are **aligned**. Woltka cannot calculate the fractions of reads versus the **original** sequencing data, since it processes alignment files instead of raw FastQ files. However, it isn't hard to do this calculation manually if you know the sequencing depth information.
 
-This function can also convert an RPK functional profile (see above) to the unit of **TPM (transcripts per kilobase million)**:
+This command can also convert an RPK functional profile (see above) to the unit of **TPM (transcripts per kilobase million)**:
 
 ```bash
 woltka normalize -i rpk.biom --scale 1M -o tpm.biom

diff --git a/doc/ogu.md b/doc/ogu.md
@@ -47,7 +47,7 @@ If needed, you may convert a BIOM table into a tab-delimited file:
 biom convert --to-tsv -i table.biom -o table.tsv
 ```
 
-Note: [**Qiita**](https://qiita.ucsd.edu/) implements the WoL database and a Woltka workflow, which performs sequence alignment using the [SHOGUN protocol](align.md#the-shogun-protocol). If you are a Qiita user, the alignment file can be automatically generated and downloaded from the Qiita interface. See [details](qiita.md).
+Note: [**Qiita**](https://qiita.ucsd.edu/) implements the WoL database and a Woltka workflow, which performs sequence alignment using the [SHOGUN protocol](align.md#the-shogun-protocol). If you are a Qiita user, the alignment file as well as the OGU table (called a "per-genome" table) can be automatically generated and downloaded from the Qiita interface. See [details](qiita.md).
 
 
 ## OGU analysis using QIIME 2
@@ -104,3 +104,19 @@ In multiple reference genome databases, subject sequences are individual nucleot
 ```bash
 woltka classify -m nucl2g.txt -i input_dir -o output.biom
 ```
+
+## OGUs from ORFs
+
+In some scenarios you may have a profile of genes, or ORFs (open reading frames), denoted as e.g. "G000123456_789" (meaning the 789th ORF of genome G000123456). This can be generated using Woltka's "coord-match" functional classification (see [details](ordinal.md)), or through sequence alignment against genes (instead of genomes) (see [details](align.md)). You can extract genome IDs (i.e., OGUs) from ORF IDs using the [`collapse`](collapse.md) command:
+
+```bash
+woltka collapse -i orf.biom -e -f 1 -o ogu.biom
+```
+
+- In this command, `-e` means that feature IDs are nested (such as the current case); `-f 1` extracts the first field in the feature IDs.
+
+Note however, that the resulting OGU table _is not identical to_ that generated from the [standard method](#ogu-table-generation). This is because gene-based methods automatically miss out [intergenic regions](https://en.wikipedia.org/wiki/Intergenic_region), which are usually insignificant in microbial genomes, but can still make a difference.
+
+When should you use this approach? For certain data types, such as [metatranscriptomic data](https://en.wikipedia.org/wiki/Metatranscriptomics), which are naturally mRNA-derived, the gene-based approach may be more justified (although you still miss out unannotated genes and [non-coding transcripts](https://en.wikipedia.org/wiki/Non-coding_RNA)). You will need to decide based on the study goals.
+
+Another more common scenario is when you want to perform a structural/functional stratification analysis. You start with ORFs, and care what these genes do and which source organisms they are from. See [details](stratify.md) about stratification.