From 672f5e79de664bd19539a255d6427aefa509c9e1 Mon Sep 17 00:00:00 2001
From: qiyunzhu <qiyunzhu@gmail.com>
Date: Sun, 25 Dec 2022 15:09:52 -0700
Subject: [PATCH 1/3] improved documentation

---
 doc/cli.md       | 83 ++++++++++++++++++++++++++++--------------------
 doc/collapse.md  |  6 ++--
 doc/coverage.md  |  2 +-
 doc/faq.md       | 12 +++----
 doc/gtdb.md      |  2 +-
 doc/input.md     |  3 +-
 doc/install.md   |  2 +-
 doc/normalize.md |  4 +--
 doc/ogu.md       | 18 ++++++++++-
 doc/ordinal.md   | 10 ++++++
 doc/output.md    |  6 ++--
 doc/stratify.md  | 37 ++++++++++++++++++++-
 12 files changed, 131 insertions(+), 54 deletions(-)

diff --git a/doc/cli.md b/doc/cli.md
index 99c7846..446fcac 100644
--- a/doc/cli.md
+++ b/doc/cli.md
@@ -2,8 +2,16 @@
 
 The command-line interface (CLI) of Woltka provides several commands:
 
-- [**classify**](#classify): Complete classification workflow with all parameters.
-- [**tools**](#tools): Utilities for working with alignments, maps and profiles.
+Main workflow:
+- [**classify**](#classify): Complete classification workflow that analyzes sequence alignments based on a classification system and generate profiles.
+
+Profile utilities:
+- [**collapse**](#collapse): Collapse a profile by feature mapping and/or hierarchy.
+- [**normalize**](#normalize): Normalize a profile to fractions and/or by feature sizes.
+- [**filter**](#filter): Filter a profile by per-sample abundance.
+- [**merge**](#merge): Merge multiple profiles into one profile.
+- [**coverage**](#coverage): Calculate per-sample coverage of feature groups.
+
 
 ## Classify
 
@@ -93,35 +101,27 @@ Option | Description
 `--no-exe` | Disable calling external programs (`gzip`, `bzip2` and `xz`) for decompression. Otherwise, Woltka will use them if available for faster processing, or switch back to Python if not.
 
 
-## Tools
-
-### Filter
-
-Filter a profile by **per-sample** abundance.
-
-* See [Per-sample filtering](filter.md) for details.
-
-Option | Description
---- | ---
-`--input`, `-i` (required) | Path to input alignment file or directory of alignment files.
-`--output`, `-o` (required) | Path to output profile file or directory of profile files.
-`--min-count`, `-c` | Per-sample minimum count threshold (>=1).
-`--min-percent`, `-p` | Per-sample minimum percentage threshold (<100).
+## Collapse
 
-### Merge
+Collapse a profile based on feature mapping (supports **many-to-many** mapping) and/or hierarchy.
 
-Merge multiple profiles into one profile.
-
-* See [Merging profiles](merge.md) for details.
+* See [profile collapsing](collapse.md) for details.
 
 Option | Description
 --- | ---
-`--input`, `-i` (required) | Path to input profiles or directories containing profiles. Can accept multiple paths.
+`--input`, `-i` (required) | Path to input profile.
 `--output`, `-o` (required) | Path to output profile.
+`--map`, `-m` | Path to mapping of source features to target features.
+`--divide`, `-d` | Count each target feature as 1 / _k_ (_k_ is the number of targets mapped to a source). Otherwise, count as one.
+`--field`, `-f` | Features are stratified (strata delimited by "\|"). For example, if features are like "species\|gene", one can use `-f 1` to collapse "species" or `-f 2` to collapse "gene".
+`--nested`, `-e` | Features are nested (each field is a child of the previous field). For example, "A_1" represents "1" of "A", and the entire feature is equivalent to stratified feature "A\|A_1". This parameter overrides the "\|"-delimited strata.
+`--sep`, `-s` | Field separator for stratified features (default: "\|") or nested features (default: "_").
+`--names`, `-n` | Path to mapping of target features to names. The names will be appended to the collapsed profile as a metadata column.
 
-### Normalize
 
-Normalize a profile to fractions or by feature sizes.
+## Normalize
+
+Normalize a profile to fractions and/or by feature sizes
 
 * See [Profile normalization](normalize.md) for details.
 
@@ -133,24 +133,34 @@ Option | Description
 `--scale`, `-s` | Scale values by this factor. Accepts "k", "M" suffixes.
 `--digits`, `-d` | Round values to this number of digits after the decimal point. If omitted, will keep decimal precision of input profile.
 
-### Collapse
 
-Collapse a profile based on feature mapping (supports **many-to-many** mapping) (details).
+## Filter
 
-* See [profile collapsing](collapse.md) for details.
+Filter a profile by **per-sample** abundance.
+
+* See [Per-sample filtering](filter.md) for details.
 
 Option | Description
 --- | ---
-`--input`, `-i` (required) | Path to input profile.
-`--map`, `-m` (required) | Path to mapping of source features to target features.
+`--input`, `-i` (required) | Path to input alignment file or directory of alignment files.
+`--output`, `-o` (required) | Path to output profile file or directory of profile files.
+`--min-count`, `-c` | Per-sample minimum count threshold (>=1).
+`--min-percent`, `-p` | Per-sample minimum percentage threshold (<100).
+
+
+## Merge
+
+Merge multiple profiles into one profile.
+
+* See [Merging profiles](merge.md) for details.
+
+Option | Description
+--- | ---
+`--input`, `-i` (required) | Path to input profiles or directories containing profiles. Can accept multiple paths.
 `--output`, `-o` (required) | Path to output profile.
-`--divide`, `-d` | Count each target feature as 1 / _k_ (_k_ is the number of targets mapped to a source). Otherwise, count as one.
-`--field`, `-f` | Features are stratified (strata delimited by "\|"). For example, if features are like "species\|gene", one can use `-f 1` to collapse "species" or `-f 2` to collapse "gene".
-`--nested`, `-e` | Features are nested (each field is a child of the previous field). For example, "A_1" represents "1" of "A", and the entire feature is equivalent to stratified feature "A\|A_1". This parameter overrides the "\|"-delimited strata.
-`--sep`, `-s` | Field separator for stratified features (default: "\|") or nested features (default: "_").
-`--names`, `-n` | Path to mapping of target features to names. The names will be appended to the collapsed profile as a metadata column.
 
-### Coverage
+
+## Coverage
 
 Calculate per-sample coverage of feature groups in a profile.
 
@@ -164,3 +174,8 @@ Option | Description
 `--threshold`, `-t` | Convert coverage to presence (1) / absence (0) data by this percentage threshold.
 `--count`, `-c` | Record numbers of covered features instead of percentages (overrides threshold).
 `--names`, `-n` | Path to mapping of feature groups to names. The names will be appended to the coverage table as a metadata column.
+
+
+## Tools
+
+A sub menu containing all commands except for `classify`. It is for backward compatibility. It is deprecated and will be removed in the next release.
diff --git a/doc/collapse.md b/doc/collapse.md
index f00716d..01381e1 100644
--- a/doc/collapse.md
+++ b/doc/collapse.md
@@ -5,13 +5,13 @@ The **profile collapsing** function is a lightweight and flexible addition to th
 Collapse features based on a mapping file:
 
 ```bash
-woltka collapse -i input.biom -m mapping.txt -o output.biom
+woltka collapse -i gene.biom -m ko.map -o ko.biom
 ```
 
-Collapse nested features to the first level:
+Collapse nested features to the second level:
 
 ```bash
-woltka collapse -i input.biom -e -f 1 -o output.biom
+woltka collapse -i lineage.tsv -e -f 2 -o phylum.tsv
 ```
 
 
diff --git a/doc/coverage.md b/doc/coverage.md
index 030da99..f464165 100644
--- a/doc/coverage.md
+++ b/doc/coverage.md
@@ -38,7 +38,7 @@ L-glutamine degradation I | 100.0 | 100.0 | 50.0 | 0.0
 Sucrose biosynthesis I | 20.0 | 20.0 | 20.0 | 20.0
 ... |
 
-**Note**: The "coverage" computed by Woltka is distinct from those by HUMAnN2 (whether the pathway is present) and HUMAnN3 (how likely the pathway is present).
+**Note**: The "coverage" computed by Woltka is not the same as those by [HUMAnN2](https://huttenhower.sph.harvard.edu/humann2) (whether the pathway is present) and [HUMAnN3](https://huttenhower.sph.harvard.edu/humann) (how likely the pathway is present), although the usage and interpretation may be comparable.
 
 ## Parameters
 
diff --git a/doc/faq.md b/doc/faq.md
index 496fa0f..c9cdba2 100644
--- a/doc/faq.md
+++ b/doc/faq.md
@@ -17,7 +17,7 @@ To date, all Woltka versions (0.1.0 to 0.1.5) generate **identical** output file
 
 ### How many CPU cores does Woltka use?
 
-Woltka works the best with two CPU cores: one for file decompression and the other for classification. This happens automatically. See [here](perform.md#keep-external-decompressors-on) for details.
+Woltka works the best with **two CPU cores**: one for file decompression and the other for classification. This happens automatically. See [here](perform.md#keep-external-decompressors-on) for details.
 
 ### Does Woltka support multiprocessing?
 
@@ -28,11 +28,11 @@ Not at the moment. But you can run multiple Woltka instances on different subset
 
 ### Does Woltka support gzipped alignment files?
 
-Yes. All input files for Woltka (alignments and databases) can be supplied as compressed in gzip, bzip2 or xz formats. Woltka will automatically recognize and process them.
+Yes. All input files for Woltka (alignments and databases) can be supplied as compressed in **gzip**, **bzip2** or **xz** formats. Woltka will automatically recognize and process them.
 
 ### Does Woltka support [BAM](https://en.wikipedia.org/wiki/Binary_Alignment_Map) and [CRAM](https://en.wikipedia.org/wiki/CRAM_(file_format)) formats?
 
-Not out-of-the-box. But you can use SAMtools to extract BAM/CRAM files and directly "pipe" into Woltka, like this ("-" represents stdin):
+Not out-of-the-box. But you can use SAMtools to extract BAM/CRAM files and directly "pipe" them into Woltka, like this ("-" represents stdin):
 
 ```bash
 samtools view input.bam | woltka classify -i - -o output.biom
@@ -97,7 +97,7 @@ Yes. See [here](normalize.md) for methods.
 
 ### I ran Woltka separately on multiple subsets of data. Can I merge the results?
 
-Yes. The `woltka merge` command is for you. See [here](merge.md) for details.
+Yes. The `merge` command is for you. See [here](merge.md) for details.
 
 ### Can Woltka report taxon names instead of TaxIDs when using NCBI taxonomy?
 
@@ -123,6 +123,6 @@ See [here](classify.md#unassigned-sequences) for details.
 
 ### I attempted to collapse a stratified feature table, and the output is empty?
 
-In this case, you will need to specify which field of a stratified feature should be collapsed, using parameter `--field` followed by field index (starting from 1), otherwise the program will try to collapse the entire feature.
+In this case, you will need to specify which field of a stratified feature should be collapsed, using parameter `--field` or `-f` followed by field index (starting from 1), otherwise the program will try to collapse the entire feature.
 
-See [here](collapse.md#stratification) for details.
+See [here](collapse.md#stratified-features) for details.
diff --git a/doc/gtdb.md b/doc/gtdb.md
index 65c0b4d..c62fe1b 100644
--- a/doc/gtdb.md
+++ b/doc/gtdb.md
@@ -2,7 +2,7 @@
 
 **GTDB** (https://gtdb.ecogenomic.org/) ([Parks et al., 2018](https://www.nature.com/articles/nbt.4229)) is a standardized taxonomy system for bacteria and archaea. It is based on the phylogenetic trees of genomes, hence reflecting the evolutionary relationships among microorganisms better, compared to conventional taxonomy. Our research shows that GTDB shares high consistency with the [WoL](https://biocore.github.io/wol/) phylogeny (which was built using more marker genes and more expensive methods) ([Zhu et al., 2019](https://www.nature.com/articles/s41467-019-13443-4)).
 
-This tutorial will be based on GTDB [**R04-RS89**](https://data.ace.uq.edu.au/public/gtdb/data/releases/release89/) (version 4, indexed to RefSeq release 89), which was the current release at the time of writing. This release contains 143,512 bacterial and 2,392 archaeal organisms in its taxonomy, or 23,458 bacterial and 1,248 archaeal genomes in its collection of actual genome sequences. These 24,706 "**species clusters**" cover the complete GTDB taxonomic framework ([Parks et al., 2020](https://www.nature.com/articles/s41587-020-0501-8)).
+This tutorial is based on GTDB [**R04-RS89**](https://data.ace.uq.edu.au/public/gtdb/data/releases/release89/) (version 4, indexed to RefSeq release 89), which was the current release at the time of writing. This release contains 143,512 bacterial and 2,392 archaeal organisms in its taxonomy, or 23,458 bacterial and 1,248 archaeal genomes in its collection of actual genome sequences. These 24,706 "**species clusters**" cover the complete GTDB taxonomic framework ([Parks et al., 2020](https://www.nature.com/articles/s41587-020-0501-8)).
 
 This tutorial will explain various strategies of utilizing GTDB taxonomy / phylogeny in a meta'omics analysis with Woltka.
 
diff --git a/doc/input.md b/doc/input.md
index aec4a14..7205083 100644
--- a/doc/input.md
+++ b/doc/input.md
@@ -28,6 +28,7 @@ align/
 
 2\. A **mapping file** of sample ID \<tab\> alignment file path. The paths must point to existing files. They can either be full paths, or simply filenames under the same directory as the mapping file. For example, one can place a `map.txt` of the following content to where alignment files are located.
 
+
 ```
 S01 <tab> S01.sam.gz
 S02 <tab> S02.sam.gz
@@ -39,7 +40,7 @@ S03 <tab> S03.sam.gz
 
 - If you don't want demultiplexing (i.e., it is a single-sample file), add `--no-demux` to the command.
 
-4\. "-", representing standard input (**stdin**). Using this method, you can ["pipe"](https://en.wikipedia.org/wiki/Pipeline_(Unix)) alignments generated by another program directly into Woltka. Demultiplexing is on by default (see above). For example, you can do:
+4\. "`-`", representing [standard input](https://en.wikipedia.org/wiki/Standard_streams#Standard_input_(stdin)) (**stdin**). Using this method, you can "[pipe](https://en.wikipedia.org/wiki/Pipeline_(Unix))" alignments generated by another program directly into Woltka. Demultiplexing is on by default (see above). For example, you can do:
 
 ```bash
 bowtie2 -x db -U input.fq | woltka classify -i - -o output.biom
diff --git a/doc/install.md b/doc/install.md
index 3beaacc..5c3d90b 100644
--- a/doc/install.md
+++ b/doc/install.md
@@ -104,7 +104,7 @@ If in the future some dependencies have changes that are not compatible with the
 conda create -n woltka python=3.10.8
 conda activate woltka
 conda install -c conda-forge biom-format=2.1.13
-conda install -c bioconda woltka=0.1.4
+conda install -c bioconda woltka=0.1.5
 ```
 
 ## Test
diff --git a/doc/normalize.md b/doc/normalize.md
index 5339ee0..841518c 100644
--- a/doc/normalize.md
+++ b/doc/normalize.md
@@ -54,7 +54,7 @@ Or post classification (on existing profiles):
 woltka normalize --sizes size.map ...
 ```
 
-A special case is during "coord-match" functional classification (see [details](ordina.md)), one can use a dot (`.`) instead of a mapping file. Woltka will read gene sizes from the gene coordinates file.
+A special case is during "coord-match" functional classification (see [details](ordinal.md)), one can use a dot (`.`) instead of a mapping file. Woltka will read gene sizes from the gene coordinates file.
 
 ```bash
 woltka classify -c coords.txt --sizes . ...
@@ -146,7 +146,7 @@ woltka normalize ... (without --sizes, automatically applies --frac)
 
 **Note**: These values are the fractions of reads assigned to each classification units versus all reads that are **assigned**. If you add `--unassigned`, the values become the fractions of reads versus all reads that are **aligned**. Woltka cannot calculate the fractions of reads versus the **original** sequencing data, since it processes alignment files instead of raw FastQ files. However, it isn't hard to do this calculation manually if you know the sequencing depth information.
 
-This function can also convert an RPK functional profile (see above) to the unit of **TPM (transcripts per kilobase million)**:
+This command can also convert an RPK functional profile (see above) to the unit of **TPM (transcripts per kilobase million)**:
 
 ```bash
 woltka normalize -i rpk.biom --scale 1M -o tpm.biom
diff --git a/doc/ogu.md b/doc/ogu.md
index e01f4bd..2ff938f 100644
--- a/doc/ogu.md
+++ b/doc/ogu.md
@@ -47,7 +47,7 @@ If needed, you may convert a BIOM table into a tab-delimited file:
 biom convert --to-tsv -i table.biom -o table.tsv
 ```
 
-Note: [**Qiita**](https://qiita.ucsd.edu/) implements the WoL database and a Woltka workflow, which performs sequence alignment using the [SHOGUN protocol](align.md#the-shogun-protocol). If you are a Qiita user, the alignment file can be automatically generated and downloaded from the Qiita interface. See [details](qiita.md).
+Note: [**Qiita**](https://qiita.ucsd.edu/) implements the WoL database and a Woltka workflow, which performs sequence alignment using the [SHOGUN protocol](align.md#the-shogun-protocol). If you are a Qiita user, the alignment file as well as the OGU table (called a "per-genome" table) can be automatically generated and downloaded from the Qiita interface. See [details](qiita.md).
 
 
 ## OGU analysis using QIIME 2
@@ -104,3 +104,19 @@ In multiple reference genome databases, subject sequences are individual nucleot
 ```bash
 woltka classify -m nucl2g.txt -i input_dir -o output.biom
 ```
+
+## OGUs from ORFs
+
+In some scenarios you may have a profile of genes, or ORFs (open reading frames), denoted as e.g. "G000123456_789" (meaning the 789th ORF of genome G000123456). This can be generated using Woltka's "coord-match" functional classification (see [details](ordinal.md)), or through sequence alignment against genes (instead of genomes) (see [details](align.md)). You can extract genome IDs (i.e., OGUs) from ORF IDs using the [`collapse`](collapse.md) command:
+
+```bash
+woltka collapse -i orf.biom -e -f 1 -o ogu.biom
+```
+
+- In this command, `-e` means that feature IDs are nested (such as the current case); `-f 1` extracts the first field in the feature IDs.
+
+Note however, that the resulting OGU table _is not identical to_ that generated from the [standard method](#ogu-table-generation). This is because gene-based methods automatically miss out [intergenic regions](https://en.wikipedia.org/wiki/Intergenic_region), which are usually insignificant in microbial genomes, but can still make a difference.
+
+When should you use this approach? For certain data types, such as [metatranscriptomic data](https://en.wikipedia.org/wiki/Metatranscriptomics), which are naturally mRNA-derived, the gene-based approach may be more justified (although you still miss out unannotated genes and [non-coding transcripts](https://en.wikipedia.org/wiki/Non-coding_RNA)). You will need to decide based on the study goals.
+
+Another more common scenario is when you want to perform a structural/functional stratification analysis. You start with ORFs, and care what these genes do and which source organisms they are from. See [details](stratify.md) about stratification.
diff --git a/doc/ordinal.md b/doc/ordinal.md
index f1a6067..48891ea 100644
--- a/doc/ordinal.md
+++ b/doc/ordinal.md
@@ -166,4 +166,14 @@ woltka classify -i indir -c coords.txt -m gene-to-pathway.map --stratify species
 
 In the output profile, the features will be like `species_A|pathway_X`. This helps with the investigation of the functional capacities of individual microbial groups.
 
+Alternatively, you can start with just the gene table, in which the gene IDs indicate their source genomes (like "G000123456_789"), and perform collapsing on the genome AND the gene respectively.
+
+```bash
+woltka classify -i indir -c coords.txt -o gene.biom
+woltka collapse -i gene.biom -f 1 -e -m genome-to-species.map -o gene_by_species.biom
+woltka collapse -i gene_by_species.biom -f 2 -m gene-to-pathway.map -o pathway_by_species.biom
+```
+
+In this way, you only need to run the classification workflow once, and do not need to save the bulky `outmap` files to the disk.
+
 See [here](stratify.md) for details of stratification.
diff --git a/doc/output.md b/doc/output.md
index 59a52d5..bb74431 100644
--- a/doc/output.md
+++ b/doc/output.md
@@ -139,10 +139,10 @@ In `outmap_dir`, there will be three subdirectories: `phylum`, `genus` and `spec
 
 ## Table utilities
 
-Woltka provides several utilities under the `tools` menu for table manipulation (both BIOM and TSV are supported and automatically recognized). They are:
+Woltka provides several utilities for table manipulation (both BIOM and TSV are supported and automatically recognized). They are:
 
-- [**Collapse**](collapse.md): Collapse a profile based on a source-to-target feature mapping; supporting many-to-many relationships.
+- [**Collapse**](collapse.md): Collapse a profile based on a source-to-target feature mapping, and/or internal hierarchy of features; supporting many-to-many relationships.
 - [**Coverage**](coverage.md): Calculate per-sample coverage of feature groups, such as the completeness of metabolic pathways or core gene sets.
-- [**Normalize**](normalize.md): Normalize a profile based on feature sizes or into relative abundances.
+- [**Normalize**](normalize.md): Normalize a profile based on feature sizes and/or into relative abundances.
 - [**Filter**](filter.md): Filter features by per-sample abundance.
 - [**Merge**](merge.md): Merge multiple profiles into one.
diff --git a/doc/stratify.md b/doc/stratify.md
index 36e5574..6c8eacf 100644
--- a/doc/stratify.md
+++ b/doc/stratify.md
@@ -4,7 +4,7 @@ It is a frequent need to combine two or more classification systems in one analy
 
 Woltka enables this analysis through a "**stratification**" function. A stratum (plural: strata) is a subset of data grouped under the same label. Woltka allows the user to label sequences using one classification system (e.g., taxonomy), then classify sequences under each label using another classification system (e.g., function). This design maximizes flexibility and modularity. It is not bond by particular classification levels or systems, but can also work with ecological niches, phenotypic features, and other classification methods that best address specific research goals.
 
-There are **two approaches** to achieve combined structural / functional classification. Both are explained and exemplified below.
+There are **three approaches** to achieve combined structural / functional classification. They are explained and exemplified below.
 
 
 ## Method I: Genome-based analysis
@@ -92,3 +92,38 @@ woltka classify \
   --stratify mapdir \
   -o taxfunc.biom
 ```
+
+
+## Method III: Serial collapsing
+
+With this method, you only need to run the time-comsuming `woltka classify` workflow once (instead of twice, as in I and II) to generate a gene (ORF) profile, then you can farewell the bulky alignment files, and only work on the profile in your laptop. All you need to do then is to run as many `woltka collapse` commands as you like to collapse it to the desired structural and functional levels. This grants you the maximum flexibility. It is particularly useful when some SOP (like the [WoL SOP](wol.md) or the [Qiita server](qiita.md)) already generates the ORF profile. The results should be _identical_ to that of method I.
+
+Step 1: Generate a gene (ORF) profile using the [coord-match algorithm](ordinal.md) without any mapping:
+
+```bash
+woltka classify -i align/bowtie2 -coords function/coords.txt.xz -o orf.biom
+```
+
+- In the resulting profile, feature IDs are like: "G000006845_1051".
+
+Step 2 to _n_: Collapse the genome (OGU) to the desired taxonomic level:
+
+```bash
+woltka collapse -i orf.biom      -e -f 1 -m  genome-to-species.map -o species_orf.biom
+woltka collapse -i species_orf.biom -f 1 -m species-to-genus.map   -o   genus_orf.biom
+woltka collapse -i   genus_orf.biom -f 1 -m   genus-to-family.map  -o  family_orf.biom
+...
+```
+
+- Note the first command. The `-e` or `--nested` flag breaks the ORF IDs (like "G000006845_1051") into genomes and genes, and only collapses the genome. You only need it _once_, and on the raw ORF table. Then you can stack up multiple `collapse` commands without `-e` to move up in the taxonomic hierarchy.
+
+Step 3 to _m_: Collapse the gene (ORF) to the desired functional level:
+
+```bash
+woltka collapse -i genus_orf.biom    -f 2 -m    orf-to-ko.map      -o genus_ko.biom
+woltka collapse -i genus_ko.biom     -f 2 -m     ko-to-module.map  -o genus_module.biom
+woltka collapse -i genus_module.biom -f 2 -m module-to-pathway.map -o genus_pathway.biom
+...
+```
+
+- This example uses the ORF by genus profile, but you can start with any intermediate profile generated from step 2. You can also mix `collapse` commands in step 2 (`-f 1`) and step 3 (`-f 2`) (order doesn't matter) until achieving the desired levels.

From e7756c59f4437ad2c4257e32d699f0273e618806 Mon Sep 17 00:00:00 2001
From: qiyunzhu <qiyunzhu@gmail.com>
Date: Sun, 25 Dec 2022 17:09:40 -0700
Subject: [PATCH 2/3] added normalize by coords

---
 woltka/ordinal.py            | 54 ++++++++++++++++++++++++++++++++++++
 woltka/tests/test_ordinal.py | 48 +++++++++++++++++++++++++++++++-
 woltka/tests/test_tools.py   | 23 ++++++++++++++-
 woltka/tools.py              | 18 +++++++++++-
 4 files changed, 140 insertions(+), 3 deletions(-)

diff --git a/woltka/ordinal.py b/woltka/ordinal.py
index f70cc0c..2ca51b9 100644
--- a/woltka/ordinal.py
+++ b/woltka/ordinal.py
@@ -757,3 +757,57 @@ def calc_gene_lens(mapper):
             else:
                 res[gid] += code >> 48
     return res
+
+
+def load_gene_lens(fh):
+    """Directly load gene lengths from a coordinates file.
+
+    Parameters
+    ----------
+    fh : file handle
+        Gene coordinates file.
+
+    Returns
+    -------
+    dict of int
+        Gene ID to length mapping.
+
+    See Also
+    --------
+    load_gene_coords
+
+    Notes
+    -----
+    Used by the "normalize" command. This is more efficient than:
+
+    >>> coords, idmap, prefix = load_gene_coords(f, sort=True)
+    >>> mapper = partial(ordinal_mapper, coords=coords, idmap=idmap,
+                         prefix=prefix)
+    >>> sizemap = calc_gene_lens(mapper)
+    """
+    lens, lens_ = {}, None
+    used, isdup = set(), None
+    used_add = used.add
+    for line in fh:
+        c0 = line[0]
+        if c0 in '>#':
+            if line[1] != c0:
+                nucl = line[1:].strip()
+                lens_ = lens[nucl] = []
+        else:
+            x = line.rstrip().split('\t')
+            try:
+                gene, beg, end = x[0], int(x[1]), int(x[2])
+            except (IndexError, ValueError):
+                raise ValueError(
+                    f'Cannot calculate gene length from line: "{line}".')
+            lens_.append((gene, abs(end - beg) + 1))
+            if isdup is None:
+                if gene in used:
+                    isdup = True
+                else:
+                    used_add(gene)
+    if isdup:
+        return {f'{k}_{g}': L for k, v in lens.items() for g, L in v}
+    else:
+        return dict(e for k, v in lens.items() for e in v)
diff --git a/woltka/tests/test_ordinal.py b/woltka/tests/test_ordinal.py
index d8b6ec9..7d20439 100644
--- a/woltka/tests/test_ordinal.py
+++ b/woltka/tests/test_ordinal.py
@@ -24,7 +24,7 @@
 from woltka.ordinal import (
     match_read_gene_dummy, match_read_gene, match_read_gene_naive,
     match_read_gene_quart, ordinal_parser_dummy, ordinal_mapper,
-    load_gene_coords, calc_gene_lens)
+    load_gene_coords, calc_gene_lens, load_gene_lens)
 
 
 class OrdinalTests(TestCase):
@@ -440,6 +440,52 @@ def test_calc_gene_lens(self):
                'NP_258147.1': 275}
         self.assertDictEqual(obs, exp)
 
+    def test_load_gene_lens(self):
+        # simple case
+        tbl = ('>n1',
+               'g1	5	29',
+               'g2	33	61',
+               'g3	65	94')
+        obs = load_gene_lens(tbl)
+        exp = {'g1': 25, 'g2': 29, 'g3': 30}
+        self.assertDictEqual(obs, exp)
+
+        # NCBI accession
+        tbl = ('## GCF_000123456\n',
+               '# NC_123456\n',
+               '1	5	384\n',
+               '2	410	933\n',
+               '# NC_789012\n',
+               '1	912	638\n',
+               '2	529	75\n')
+        obs = load_gene_lens(tbl)
+        exp = {'NC_123456_1': 380, 'NC_123456_2': 524,
+               'NC_789012_1': 275, 'NC_789012_2': 455}
+        self.assertDictEqual(obs, exp)
+
+        # incorrect formats
+        # only one column
+        msg = 'Cannot calculate gene length from line:'
+        with self.assertRaises(ValueError) as ctx:
+            load_gene_lens(('hello',))
+        self.assertEqual(str(ctx.exception), f'{msg} "hello".')
+        # only two columns
+        with self.assertRaises(ValueError) as ctx:
+            load_gene_lens(('hello\t100',))
+        self.assertEqual(str(ctx.exception), f'{msg} "hello\t100".')
+        # three columns but 3rd is string
+        with self.assertRaises(ValueError) as ctx:
+            load_gene_lens(('hello\t100\tthere',))
+        self.assertEqual(str(ctx.exception), f'{msg} "hello\t100\tthere".')
+
+        # real coords file
+        fp = join(self.datdir, 'function', 'coords.txt.xz')
+        with openzip(fp) as f:
+            obs = load_gene_lens(f)
+        self.assertEqual(obs['G000006845_253'], 996)
+        self.assertEqual(obs['G000006745_1171'], 549)
+        self.assertEqual(obs['G000006925_4349'], 4224)
+
 
 if __name__ == '__main__':
     main()
diff --git a/woltka/tests/test_tools.py b/woltka/tests/test_tools.py
index 9ef40b3..c19fa94 100644
--- a/woltka/tests/test_tools.py
+++ b/woltka/tests/test_tools.py
@@ -51,12 +51,33 @@ def test_normalize_wf(self):
         errmsg = '"Hi!" is not a valid scale factor.'
         self.assertEqual(str(ctx.exception), errmsg)
 
-        # missing sizes
+        # by gene size (from coords)
+        input_fp = join(self.datdir, 'output', 'bowtie2.orf.tsv')
         sizes_fp = join(self.datdir, 'function', 'coords.txt.xz')
+        normalize_wf(input_fp, output_fp, sizes_fp, scale='1k', digits=3)
+        with open(output_fp, 'r') as f:
+            for line in f:
+                gene, _, datum = line.rstrip().partition('\t')
+                if gene == 'G000006925_52':
+                    self.assertEqual(datum, '0.0\t0.0\t0.849\t0.0\t0.0')
+                if gene == 'G000008165_4079':
+                    self.assertEqual(datum, '0.0\t0.0\t1.178\t0.0\t0.0')
+                if gene == 'G000014525_1009':
+                    self.assertEqual(datum, '0.0\t0.0\t0.0\t0.0\t0.887')
+
+        # missing sizes
+        sizes_fp = join(self.datdir, 'tree.nwk')
         with self.assertRaises(SystemExit) as ctx:
             normalize_wf(input_fp, output_fp, sizes_fp)
         errmsg = 'One or more features are not found in the size map.'
         self.assertEqual(str(ctx.exception), errmsg)
+
+        # empty size file
+        open(output_fp, 'w').close()
+        with self.assertRaises(SystemExit) as ctx:
+            normalize_wf(input_fp, output_fp, output_fp)
+        errmsg = 'Size map file is empty or unreadable.'
+        self.assertEqual(str(ctx.exception), errmsg)
         remove(output_fp)
 
     def test_filter_wf(self):
diff --git a/woltka/tools.py b/woltka/tools.py
index f845f36..967fafb 100644
--- a/woltka/tools.py
+++ b/woltka/tools.py
@@ -14,6 +14,7 @@
 from sys import exit
 from os import listdir
 from os.path import isdir, join, basename
+from itertools import chain
 import click
 
 from .table import (
@@ -23,6 +24,7 @@
 from .file import readzip, read_map_1st, read_map_many, read_map_all
 from .util import scale_factor
 from .tree import read_names
+from .ordinal import load_gene_lens
 
 
 def normalize_wf(input_fp:  str,
@@ -57,7 +59,21 @@ def normalize_wf(input_fp:  str,
         click.echo(f'Reading feature sizes from file: {basename(sizes_fp)}...',
                    nl=False)
         with readzip(sizes_fp, {}) as f:
-            sizes = {k: float(v) for k, v in read_map_1st(f)}
+
+            # check file format based on first line
+            try:
+                line = next(f)
+            except StopIteration:
+                exit('Size map file is empty or unreadable.')
+            f_ = chain(iter([line]), f)
+
+            # gene coordinates file
+            if line[0] in '>#':
+                sizes = load_gene_lens(f_)
+
+            # regular mapping file
+            else:
+                sizes = {k: float(v) for k, v in read_map_1st(f_)}
         click.echo(' Done.')
         click.echo('Normalizing profile by feature size...', nl=False)
         try:

From 87af16c4c854fbcfc910deb744a97207bac276d4 Mon Sep 17 00:00:00 2001
From: qiyunzhu <qiyunzhu@gmail.com>
Date: Sun, 25 Dec 2022 17:29:47 -0700
Subject: [PATCH 3/3] updated doc

---
 doc/faq.md       | 19 +++++++++++++++++--
 doc/normalize.md | 23 ++++++++++++++++-------
 doc/ordinal.md   |  1 +
 3 files changed, 34 insertions(+), 9 deletions(-)

diff --git a/doc/faq.md b/doc/faq.md
index c9cdba2..8420a64 100644
--- a/doc/faq.md
+++ b/doc/faq.md
@@ -89,11 +89,26 @@ Woltka provides multiple normalization features. For example, if you want to get
 woltka classify ... --frac
 ```
 
-See [here](normalize.md) for details.
+See [here](normalize.md#relative-abundance) for details.
 
 ### Can Woltka report cell values in the units of CPM, RPK, and TPM?
 
-Yes. See [here](normalize.md) for methods.
+Yes. See [here](normalize.md) for methods. For example:
+
+```bash
+woltka classify ... -c coords.txt --sizes . --scale 1k --digits 3 -o rpk.biom
+woltka normalize -i rpk.biom --scale 1M -o tpm.biom
+```
+
+### I already got a gene (ORF) profile of raw counts, can I still convert it into RPK?
+
+Yes. You can do:
+
+```bash
+woltka normalize -i orf.tsv -z coords.txt -s 1k -d 3 -o orf.rpk.tsv
+```
+
+See [here](normalize.md#abundance-of-functional-genes) for details.
 
 ### I ran Woltka separately on multiple subsets of data. Can I merge the results?
 
diff --git a/doc/normalize.md b/doc/normalize.md
index 841518c..d5284a6 100644
--- a/doc/normalize.md
+++ b/doc/normalize.md
@@ -13,7 +13,8 @@ By default, the cell values in a feature table (profile) are **counts** (**frequ
 - [During or after classification](#during-or-after-classification)
 - [Normalization by subject size](#normalization-by-subject-size)
   - [Sequence and taxonomic abundances](#sequence-and-taxonomic-abundances)
-  - [Why real-time normalization matters](#Why-real-time-normalization-matters)
+  - [Gene coordinates as size map](#gene-coordinates-as-size-map)
+  - [Why real-time normalization matters](#why-real-time-normalization-matters)
   - [Abundance of functional genes](#abundance-of-functional-genes)
 - [Relative abundance](#relative-abundance)
 
@@ -54,12 +55,6 @@ Or post classification (on existing profiles):
 woltka normalize --sizes size.map ...
 ```
 
-A special case is during "coord-match" functional classification (see [details](ordinal.md)), one can use a dot (`.`) instead of a mapping file. Woltka will read gene sizes from the gene coordinates file.
-
-```bash
-woltka classify -c coords.txt --sizes . ...
-```
-
 ### Sequence and taxonomic abundances
 
 A common usage of this function is to convert **sequence abundance** to **taxonomic abundance**.
@@ -125,6 +120,20 @@ woltka classify \
 
 The output values are in the unit of **reads per kilobase, or RPK**. They reflect the quantity of functional genes found in the sample.
 
+Alternatively, if one already obtained a gene (ORF) profile without normalization:
+
+```bash
+woltka classify -i indir -c coords.txt -o orf.tsv
+```
+
+On can still normalize the profile by feeding the gene coordinates file to the `normalize` command:
+
+```bash
+woltka normalize -i orf.tsv --sizes coords.txt --scale 1k --digits 3 -o orf.rpk.tsv
+```
+
+There are two notes though. First, this can only be done with the genes (ORFs) but not higher functional units ([explained](#why-real-time-normalization-matters) above). Second, there may be slight difference between some cell values generated using the two methods. This is caused by rounding imprecision when the same read is mapped to multiple genes (ORFs). To avoid this (if you are paranoid about it), you may add `--digits 3` to the `classify` command to make the numbers more precise. Nevertheless, this issue likely won't affect the analysis outcome.
+
 
 ### Relative abundance
 
diff --git a/doc/ordinal.md b/doc/ordinal.md
index 48891ea..a643f89 100644
--- a/doc/ordinal.md
+++ b/doc/ordinal.md
@@ -144,6 +144,7 @@ In the [WoL data release](wol.md), there are pre-built mappings to UniRef, GO, M
 
 - Note: For some databases, such as [MetaCyc](https://metacyc.org), you might encounter an error regarding `AssertionError: Conflicting values found for ...`. This is likely because some classification units were simultaneously assigned to multiple ranks in the database, which causes conflicts in the Woltka workflow. In this case we recommend generating the gene-level profile with `woltka classify`, and then collapsing to individual levels one at a time with `woltka collapse`.
 
+
 ## Pathway coverage
 
 It is a frequent goal to assess the abundances of individual metabolic pathways in the microbiome. However, a pathway only makes sense when all (or an essential subset) of its member enzymes are present. Woltka can calculate the percent **coverage** of pathways based on the presence/absence of member genes as follows: