Skip to content

Commit

Permalink
updated doc
Browse files Browse the repository at this point in the history
  • Loading branch information
qiyunzhu committed Dec 26, 2022
1 parent e7756c5 commit 87af16c
Show file tree
Hide file tree
Showing 3 changed files with 34 additions and 9 deletions.
19 changes: 17 additions & 2 deletions doc/faq.md
Original file line number Diff line number Diff line change
Expand Up @@ -89,11 +89,26 @@ Woltka provides multiple normalization features. For example, if you want to get
woltka classify ... --frac
```

See [here](normalize.md) for details.
See [here](normalize.md#relative-abundance) for details.

### Can Woltka report cell values in the units of CPM, RPK, and TPM?

Yes. See [here](normalize.md) for methods.
Yes. See [here](normalize.md) for methods. For example:

```bash
woltka classify ... -c coords.txt --sizes . --scale 1k --digits 3 -o rpk.biom
woltka normalize -i rpk.biom --scale 1M -o tpm.biom
```

### I already got a gene (ORF) profile of raw counts, can I still convert it into RPK?

Yes. You can do:

```bash
woltka normalize -i orf.tsv -z coords.txt -s 1k -d 3 -o orf.rpk.tsv
```

See [here](normalize.md#abundance-of-functional-genes) for details.

### I ran Woltka separately on multiple subsets of data. Can I merge the results?

Expand Down
23 changes: 16 additions & 7 deletions doc/normalize.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,8 @@ By default, the cell values in a feature table (profile) are **counts** (**frequ
- [During or after classification](#during-or-after-classification)
- [Normalization by subject size](#normalization-by-subject-size)
- [Sequence and taxonomic abundances](#sequence-and-taxonomic-abundances)
- [Why real-time normalization matters](#Why-real-time-normalization-matters)
- [Gene coordinates as size map](#gene-coordinates-as-size-map)
- [Why real-time normalization matters](#why-real-time-normalization-matters)
- [Abundance of functional genes](#abundance-of-functional-genes)
- [Relative abundance](#relative-abundance)

Expand Down Expand Up @@ -54,12 +55,6 @@ Or post classification (on existing profiles):
woltka normalize --sizes size.map ...
```

A special case is during "coord-match" functional classification (see [details](ordinal.md)), one can use a dot (`.`) instead of a mapping file. Woltka will read gene sizes from the gene coordinates file.

```bash
woltka classify -c coords.txt --sizes . ...
```

### Sequence and taxonomic abundances

A common usage of this function is to convert **sequence abundance** to **taxonomic abundance**.
Expand Down Expand Up @@ -125,6 +120,20 @@ woltka classify \

The output values are in the unit of **reads per kilobase, or RPK**. They reflect the quantity of functional genes found in the sample.

Alternatively, if one already obtained a gene (ORF) profile without normalization:

```bash
woltka classify -i indir -c coords.txt -o orf.tsv
```

On can still normalize the profile by feeding the gene coordinates file to the `normalize` command:

```bash
woltka normalize -i orf.tsv --sizes coords.txt --scale 1k --digits 3 -o orf.rpk.tsv
```

There are two notes though. First, this can only be done with the genes (ORFs) but not higher functional units ([explained](#why-real-time-normalization-matters) above). Second, there may be slight difference between some cell values generated using the two methods. This is caused by rounding imprecision when the same read is mapped to multiple genes (ORFs). To avoid this (if you are paranoid about it), you may add `--digits 3` to the `classify` command to make the numbers more precise. Nevertheless, this issue likely won't affect the analysis outcome.


### Relative abundance

Expand Down
1 change: 1 addition & 0 deletions doc/ordinal.md
Original file line number Diff line number Diff line change
Expand Up @@ -144,6 +144,7 @@ In the [WoL data release](wol.md), there are pre-built mappings to UniRef, GO, M

- Note: For some databases, such as [MetaCyc](https://metacyc.org), you might encounter an error regarding `AssertionError: Conflicting values found for ...`. This is likely because some classification units were simultaneously assigned to multiple ranks in the database, which causes conflicts in the Woltka workflow. In this case we recommend generating the gene-level profile with `woltka classify`, and then collapsing to individual levels one at a time with `woltka collapse`.


## Pathway coverage

It is a frequent goal to assess the abundances of individual metabolic pathways in the microbiome. However, a pathway only makes sense when all (or an essential subset) of its member enzymes are present. Woltka can calculate the percent **coverage** of pathways based on the presence/absence of member genes as follows:
Expand Down

0 comments on commit 87af16c

Please sign in to comment.