Skip to content

Latest commit

 

History

History
45 lines (30 loc) · 3.42 KB

kegg.md

File metadata and controls

45 lines (30 loc) · 3.42 KB

Working with KEGG

KEGG (https://www.genome.jp/kegg/) (Kanehisa et al., 2021) is a classical database of biological functions. It provides a well-organized hierarchical identification system, such as orthologies (K), modules (M), reactions (R), compounds (C), pathways, diseases and more.

Whereas the FTP access to KEGG is limited to users with a subscription, the mapping of UniRef entries to KEGG orthology (KO) entries is freely available from the UniProt data release. From this point on, we provide a Python script: kegg_query.py to automatically retrieve higher-level classification information of a given KO list or table from the KEGG server. This is made possible using the official KEGG REST API, which is freely available to academic users (however restrictions may apply; see official policy here).

Step 1: Classify sequencing data to KO entries, bridged by a UniRef-to-KO mapping file:

woltka classify \
  --input  input_dir \
  --coords coords.txt.xz \
  --map    uniref/uniref.map.xz \
  --map    kegg/ko.map.xz \
  --rank   ko \
  --output ko.tsv

Step 2: Use the script kegg_query.py to build higher hierarchies of the KO's in the profile:

python kegg_query.py ko.tsv

Be patient, as the KEGG server limits the number of queries per time to 10 (see policy).

This will generate multiple mapping files in the current directory. The filenames are self-explanatory. For examples: ko-to-reaction.txt is a mapping of KOs to reactions (R), ko-to-module.txt is a mapping to modules (M), ko-to-pathway.txt is to pathways, etc.

Step 3: Use Woltka's collapse command to convert KO's to higher-level classification units. This command supports many-to-many mapping, because that's the nature of the relationships between functional units (e.g., genes vs pathways). For example, the following command will generate a profile of reactions:

woltka collapse -i ko.tsv -m kegg/ko-to-reaction.txt -o reaction.tsv

Once a reaction profile is generated, several mapping files help to find more information about the reactions, such as reaction_name.txt, reaction_equation.txt and reaction_definition.txt. Meanwhile several other one-to-many mapping files, such as reaction-to-compound.txt, reaction-to-module.txt, reaction-to-rclass.txt help to explore the data toward other classification levels. All are enabled by the collapse command.

Step 4: You may want to explore the completeness of individual reactions, modules and pathways. This can be achieved using Woltka's coverage command. For example:

woltka coverage -i reaction.tsv -m kegg/module-to-reaction.txt -o module.cov.tsv

Differently from the last command, the coverage command generates a table, where cell values indicate the percentage of reactions required by each module found in each sample.

Step 5: You may also consider stratifying the functional profile by microbiome components. See details.