Phylogenetic placement of ASVs #236

erikrikarddaniel · 2021-03-23T09:48:53Z

Phylogenetic placement, using Pplacer, is a better way of getting a phylogenetic tree of short amplicons than making a phylogeny from an alignment of the ASVs on their own.

There's a QIIME2 module called q2-fragment-insertion for the purpose, or we can roll our own (see below). The builtin phylogeny in the QIIME2 module is "the 99% OTU Greengenes 13.8 tree with 203,452 tips", but I think one can specify ones own tree. Greengenes is, as you know, outdated.

Rolling our own would allow users to provide an alignment or hmm plus a phylogeny. The builtin would, I suggest, use GTDB species representative genome data. A problem here is that GTDB provides SSU sequences, but not a phylogeny built on them. We would therefore either have to estimate the phylogeny or assume that the protein alignment-based phylogeny is correct and build a reference SSU alignment from the sequences they provide.

Processing outline for GTDB:

Decide which ASVs are archaeal and bacterial respectively. Can be done based on the taxonomic classification I think.
Fetch SSU representative genome SSU sequences for archaea and bacteria respectively.
Fetch the GTDB phylogenies. As far as I can see, only the full phylos are available, so we need to subset to the genomes present in the SSU sequences; this can be done with R.
Align GTDB SSU sequences and ASVs. I suggest we use hmmalign to align to the hmms used e.g. by Barrnap and output only positions that match the profile. *)
Run Pplacer on the subset GTDB phylo and the alignment. *)
Extract a phylogeny and subset to only ASVs

*) Candidates for nf-core modules.

The text was updated successfully, but these errors were encountered:

erikrikarddaniel · 2021-03-25T20:26:36Z

I've done some testing that I'm summarizing here.

Stage the following files:
https://github.com/raw/tseemann/barrnap/master/db/arc.hmm
https://github.com/raw/tseemann/barrnap/master/db/bac.hmm
https://data.gtdb.ecogenomic.org/releases/latest/genomic_files_all/ssu_all.tar.gz (or the representatives, though quite few archaeal SSU sequences from representative genomes are complete enough).
https://data.gtdb.ecogenomic.org/releases/latest/ar122.tree.gz
https://data.gtdb.ecogenomic.org/releases/latest/bac120.tree.gz
Extract the 16S hmm from the barrnap multi-hmms and extract archaeal and bacterial respectively from the ssu_all.fna

hmmfetch arc.hmm 16S_rRNA > arc.16S_rRNA.hmm
hmmfetch bac.hmm 16S_rRNA > bac.16S_rRNA.hmm
grep -A 1 'd__Archaea' ssu_all_r95.fna | grep -v '^--' | sed 's/ .*//' > ar122_full.fna
grep -A 1 'd__Bacteria' ssu_all_r95.fna | grep -v '^--' | sed 's/ .*//' > bac120_full.fna

Extract archaeal and bacterial sequences from sequences.fasta using taxonomy.tsv to arcq.fna and bacq.fna respectively (R)
Align the GTDB sequences and the query sequences to the Barrnap hmms:

hmmalign arc.16S_rRNA.hmm ar122_full.fna  | esl-alimask --rf-is-mask - | esl-alimanip --rffract 0.5 - | esl-reformat -n afa - > ar122_full.alnfna
hmmalign bac.16S_rRNA.hmm bac120_full.fna | esl-alimask --rf-is-mask - | esl-alimanip --rffract 0.5 - | esl-reformat -n afa - > bac120_full.alnfna
hmmalign arc.16S_rRNA.hmm arcq.fna | esl-alimask --rf-is-mask - | esl-reformat -n afa - > arcq.alnfna
hmmalign bac.16S_rRNA.hmm bacq.fna | esl-alimask --rf-is-mask - | esl-reformat -n afa - > bacq.alnfna

Subset trees to the taxa present in alignments

#!/usr/bin/env Rscript

library(Biostrings)
library(ape)

s <- readDNAStringSet('ar122_full.alnfna')
t <- read.tree('ar122.tree.gz')

dl <- t$tip.label[!t$tip.label %in% names(s)]

write.tree(drop.tip(t, dl), 'ar122_full.newick')
# Repeat for bacteria (separate process)

Run EPA-ng to place queries. (Supposedly faster, and I could get it to work.)

epa-ng -t ar122_full.newick -s ar122_full.alnfna -q arcq.alnfna -m GTR+G
# Repeat for bacteria (separate process)
# The above results in a epa_result.jplace file

Gappa can be used to create a tree with both reference and query sequences:

gappa examine graft --jplace-path epa_result.jplace

Gappa can also be used to create Unifrac, aka Kantorovich-Rubinstein (KR), distances and many other things, so the epa_result.jplace should be output to results.

d4straub · 2023-02-17T12:08:38Z

Phylogenetic placement is now implemented in nf-core subworkflow fasta_newick_epang_gappa and in https://nf-co.re/phyloplace. Would you like to attempt to integrate it into ampliseq?

erikrikarddaniel · 2023-02-17T12:13:57Z

Absolutely. There are a couple of things we should discuss regarding the implementation, particularly with regards to reference trees and alignments. I thought we could discuss that when we meet next month.

d4straub · 2023-02-17T12:43:25Z

Alright!

d4straub · 2023-03-27T11:43:50Z

This issue was broken into #561 & #562, so I close this one here.

erikrikarddaniel added the enhancement New feature or request label Mar 23, 2021

erikrikarddaniel self-assigned this Mar 23, 2021

d4straub closed this as completed Mar 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phylogenetic placement of ASVs #236

Phylogenetic placement of ASVs #236

erikrikarddaniel commented Mar 23, 2021

erikrikarddaniel commented Mar 25, 2021

d4straub commented Feb 17, 2023

erikrikarddaniel commented Feb 17, 2023

d4straub commented Feb 17, 2023

d4straub commented Mar 27, 2023

Phylogenetic placement of ASVs #236

Phylogenetic placement of ASVs #236

Comments

erikrikarddaniel commented Mar 23, 2021

erikrikarddaniel commented Mar 25, 2021

d4straub commented Feb 17, 2023

erikrikarddaniel commented Feb 17, 2023

d4straub commented Feb 17, 2023

d4straub commented Mar 27, 2023