updated doc

qiyunzhu · Dec 26, 2022 · 87af16c · 87af16c
1 parent e7756c5
commit 87af16c
Show file tree

Hide file tree

Showing 3 changed files with 34 additions and 9 deletions.
diff --git a/doc/faq.md b/doc/faq.md
@@ -89,11 +89,26 @@ Woltka provides multiple normalization features. For example, if you want to get
 woltka classify ... --frac
 ```
 
-See [here](normalize.md) for details.
+See [here](normalize.md#relative-abundance) for details.
 
 ### Can Woltka report cell values in the units of CPM, RPK, and TPM?
 
-Yes. See [here](normalize.md) for methods.
+Yes. See [here](normalize.md) for methods. For example:
+
+```bash
+woltka classify ... -c coords.txt --sizes . --scale 1k --digits 3 -o rpk.biom
+woltka normalize -i rpk.biom --scale 1M -o tpm.biom
+```
+
+### I already got a gene (ORF) profile of raw counts, can I still convert it into RPK?
+
+Yes. You can do:
+
+```bash
+woltka normalize -i orf.tsv -z coords.txt -s 1k -d 3 -o orf.rpk.tsv
+```
+
+See [here](normalize.md#abundance-of-functional-genes) for details.
 
 ### I ran Woltka separately on multiple subsets of data. Can I merge the results?
 

diff --git a/doc/normalize.md b/doc/normalize.md
@@ -13,7 +13,8 @@ By default, the cell values in a feature table (profile) are **counts** (**frequ
 - [During or after classification](#during-or-after-classification)
 - [Normalization by subject size](#normalization-by-subject-size)
   - [Sequence and taxonomic abundances](#sequence-and-taxonomic-abundances)
-  - [Why real-time normalization matters](#Why-real-time-normalization-matters)
+  - [Gene coordinates as size map](#gene-coordinates-as-size-map)
+  - [Why real-time normalization matters](#why-real-time-normalization-matters)
   - [Abundance of functional genes](#abundance-of-functional-genes)
 - [Relative abundance](#relative-abundance)
 
@@ -54,12 +55,6 @@ Or post classification (on existing profiles):
 woltka normalize --sizes size.map ...
 ```
 
-A special case is during "coord-match" functional classification (see [details](ordinal.md)), one can use a dot (`.`) instead of a mapping file. Woltka will read gene sizes from the gene coordinates file.
-
-```bash
-woltka classify -c coords.txt --sizes . ...
-```
-
 ### Sequence and taxonomic abundances
 
 A common usage of this function is to convert **sequence abundance** to **taxonomic abundance**.
@@ -125,6 +120,20 @@ woltka classify \
 
 The output values are in the unit of **reads per kilobase, or RPK**. They reflect the quantity of functional genes found in the sample.
 
+Alternatively, if one already obtained a gene (ORF) profile without normalization:
+
+```bash
+woltka classify -i indir -c coords.txt -o orf.tsv
+```
+
+On can still normalize the profile by feeding the gene coordinates file to the `normalize` command:
+
+```bash
+woltka normalize -i orf.tsv --sizes coords.txt --scale 1k --digits 3 -o orf.rpk.tsv
+```
+
+There are two notes though. First, this can only be done with the genes (ORFs) but not higher functional units ([explained](#why-real-time-normalization-matters) above). Second, there may be slight difference between some cell values generated using the two methods. This is caused by rounding imprecision when the same read is mapped to multiple genes (ORFs). To avoid this (if you are paranoid about it), you may add `--digits 3` to the `classify` command to make the numbers more precise. Nevertheless, this issue likely won't affect the analysis outcome.
+
 
 ### Relative abundance
 

diff --git a/doc/ordinal.md b/doc/ordinal.md
@@ -144,6 +144,7 @@ In the [WoL data release](wol.md), there are pre-built mappings to UniRef, GO, M
 
 - Note: For some databases, such as [MetaCyc](https://metacyc.org), you might encounter an error regarding `AssertionError: Conflicting values found for ...`. This is likely because some classification units were simultaneously assigned to multiple ranks in the database, which causes conflicts in the Woltka workflow. In this case we recommend generating the gene-level profile with `woltka classify`, and then collapsing to individual levels one at a time with `woltka collapse`.
 
+
 ## Pathway coverage
 
 It is a frequent goal to assess the abundances of individual metabolic pathways in the microbiome. However, a pathway only makes sense when all (or an essential subset) of its member enzymes are present. Woltka can calculate the percent **coverage** of pathways based on the presence/absence of member genes as follows: