Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Phylogenetic placement of ASVs #236

Closed
erikrikarddaniel opened this issue Mar 23, 2021 · 5 comments
Closed

Phylogenetic placement of ASVs #236

erikrikarddaniel opened this issue Mar 23, 2021 · 5 comments
Assignees
Labels
enhancement New feature or request

Comments

@erikrikarddaniel
Copy link
Member

Phylogenetic placement, using Pplacer, is a better way of getting a phylogenetic tree of short amplicons than making a phylogeny from an alignment of the ASVs on their own.

There's a QIIME2 module called q2-fragment-insertion for the purpose, or we can roll our own (see below). The builtin phylogeny in the QIIME2 module is "the 99% OTU Greengenes 13.8 tree with 203,452 tips", but I think one can specify ones own tree. Greengenes is, as you know, outdated.

Rolling our own would allow users to provide an alignment or hmm plus a phylogeny. The builtin would, I suggest, use GTDB species representative genome data. A problem here is that GTDB provides SSU sequences, but not a phylogeny built on them. We would therefore either have to estimate the phylogeny or assume that the protein alignment-based phylogeny is correct and build a reference SSU alignment from the sequences they provide.

Processing outline for GTDB:

  1. Decide which ASVs are archaeal and bacterial respectively. Can be done based on the taxonomic classification I think.
  2. Fetch SSU representative genome SSU sequences for archaea and bacteria respectively.
  3. Fetch the GTDB phylogenies. As far as I can see, only the full phylos are available, so we need to subset to the genomes present in the SSU sequences; this can be done with R.
  4. Align GTDB SSU sequences and ASVs. I suggest we use hmmalign to align to the hmms used e.g. by Barrnap and output only positions that match the profile. *)
  5. Run Pplacer on the subset GTDB phylo and the alignment. *)
  6. Extract a phylogeny and subset to only ASVs

*) Candidates for nf-core modules.

@erikrikarddaniel erikrikarddaniel added the enhancement New feature or request label Mar 23, 2021
@erikrikarddaniel erikrikarddaniel self-assigned this Mar 23, 2021
@erikrikarddaniel
Copy link
Member Author

I've done some testing that I'm summarizing here.

  1. Stage the following files:
    https://github.com/raw/tseemann/barrnap/master/db/arc.hmm
    https://github.com/raw/tseemann/barrnap/master/db/bac.hmm
    https://data.gtdb.ecogenomic.org/releases/latest/genomic_files_all/ssu_all.tar.gz (or the representatives, though quite few archaeal SSU sequences from representative genomes are complete enough).
    https://data.gtdb.ecogenomic.org/releases/latest/ar122.tree.gz
    https://data.gtdb.ecogenomic.org/releases/latest/bac120.tree.gz
  2. Extract the 16S hmm from the barrnap multi-hmms and extract archaeal and bacterial respectively from the ssu_all.fna
hmmfetch arc.hmm 16S_rRNA > arc.16S_rRNA.hmm
hmmfetch bac.hmm 16S_rRNA > bac.16S_rRNA.hmm
grep -A 1 'd__Archaea' ssu_all_r95.fna | grep -v '^--' | sed 's/ .*//' > ar122_full.fna
grep -A 1 'd__Bacteria' ssu_all_r95.fna | grep -v '^--' | sed 's/ .*//' > bac120_full.fna
  1. Extract archaeal and bacterial sequences from sequences.fasta using taxonomy.tsv to arcq.fna and bacq.fna respectively (R)
  2. Align the GTDB sequences and the query sequences to the Barrnap hmms:
hmmalign arc.16S_rRNA.hmm ar122_full.fna  | esl-alimask --rf-is-mask - | esl-alimanip --rffract 0.5 - | esl-reformat -n afa - > ar122_full.alnfna
hmmalign bac.16S_rRNA.hmm bac120_full.fna | esl-alimask --rf-is-mask - | esl-alimanip --rffract 0.5 - | esl-reformat -n afa - > bac120_full.alnfna
hmmalign arc.16S_rRNA.hmm arcq.fna | esl-alimask --rf-is-mask - | esl-reformat -n afa - > arcq.alnfna
hmmalign bac.16S_rRNA.hmm bacq.fna | esl-alimask --rf-is-mask - | esl-reformat -n afa - > bacq.alnfna
  1. Subset trees to the taxa present in alignments
#!/usr/bin/env Rscript

library(Biostrings)
library(ape)

s <- readDNAStringSet('ar122_full.alnfna')
t <- read.tree('ar122.tree.gz')

dl <- t$tip.label[!t$tip.label %in% names(s)]

write.tree(drop.tip(t, dl), 'ar122_full.newick')
# Repeat for bacteria (separate process)
  1. Run EPA-ng to place queries. (Supposedly faster, and I could get it to work.)
epa-ng -t ar122_full.newick -s ar122_full.alnfna -q arcq.alnfna -m GTR+G
# Repeat for bacteria (separate process)
# The above results in a epa_result.jplace file
  1. Gappa can be used to create a tree with both reference and query sequences:
gappa examine graft --jplace-path epa_result.jplace

Gappa can also be used to create Unifrac, aka Kantorovich-Rubinstein (KR), distances and many other things, so the epa_result.jplace should be output to results.

@d4straub
Copy link
Collaborator

Phylogenetic placement is now implemented in nf-core subworkflow fasta_newick_epang_gappa and in https://nf-co.re/phyloplace. Would you like to attempt to integrate it into ampliseq?

@erikrikarddaniel
Copy link
Member Author

Absolutely. There are a couple of things we should discuss regarding the implementation, particularly with regards to reference trees and alignments. I thought we could discuss that when we meet next month.

@d4straub
Copy link
Collaborator

Alright!

@d4straub
Copy link
Collaborator

This issue was broken into #561 & #562, so I close this one here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants