Skip to content

Classify genomes

Alessio Milanese edited this page Sep 22, 2020 · 6 revisions

You need three inputs:

  1. the genome fasta file(s) (-i or -D)
  2. a STAG database for genomes (-d)
  3. the path where to save the result (-o)

Provide the input genomes

You can provide the input genome fasta with -i, if you want to classify only one genome (example unknown_genome.fa):

stag classify_genome -i unknown_genome.fa [...]

You can give as input all fasta files inside a directory with the -D command:

stag classify_genome -D directory/with/fasta/genomes [...]

STAG database

Second, you need the STAG database (-d). You can either use one that we already compiled (link), or you can create one yourself (link).

Output

Finally, you need to provide a path where to save the result:

stag classify_genome -o result/dir [...]

The directory result/dir contains three directories and one file:

  • file genome_annotation contains the annotation for each genome (one per line)
  • dir genes_predictions contains one file per genome, with the classification of the single marker genes. Note that if a marker gene was not present in the genome, then it will not be present in this file.
  • dir MG_sequences contains all the marker gene extracted from the genomes. There are two files per marker gene, one containing the gene sequences and one containing the protein sequences.
  • dir MG_ali containing the stag alignments of the genes.

NOTE: the genome_annotation is obtained by first concatenating the alignments of the single genes and then classifying this. If the genome (or MAG) is not complete and some marker genes are missing, then the classification will be of poor quality. In this case, it would make sense to check the annotation of the single genes. We will implement a better way to merge this two annotations soon.

Select genes

The genes are predicted with Prdigal and then selected with HMMR (using hmmsearch). We select only the genes that are above a certain threshold.

Since the marker genes are expected to be present in single copy, by default we select only one gene per marker gene (above the threshold). Note that the gene selected is chosen randomly. With the command -r is possible to taxonomically annotate all genes above the threshold. The result will be in the result dir (-o) under genes_predictions.