Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GTDB Bacteria + Refseq Non-Bacteria database #4

Open
tseemann opened this issue Jun 27, 2024 · 2 comments
Open

GTDB Bacteria + Refseq Non-Bacteria database #4

tseemann opened this issue Jun 27, 2024 · 2 comments
Assignees

Comments

@tseemann
Copy link

tseemann commented Jun 27, 2024

Thanks for writing taxor and including useful databases!

What the community (or maybe just me) really wants is a database that covers more of the microbial kingdom, but with the benefit of GTDB for the bacteria and archaea. Along with a human and synthetic sequence entries.

My dream database is:

  • Bacteria - GTDB
  • Archaea - GTDB
  • Virues - Refseq or Genbank
  • Fungi - Genbank
  • Protozoa - Genbank
  • Human - single or pangenome to catch host DNA
  • Artificial sequences - adaptors, vectors etc

I understand munging GTDB taxonomy with NCBI is a challenge, but do you think this database would be achievable?

Public health labs around the world would be grareful!

@JensUweUlrich JensUweUlrich self-assigned this Jun 28, 2024
@JensUweUlrich
Copy link
Owner

I think this can be done relatively quickly. What Taxor needs is just a directory of Fasta files (one per species) and a metadata file that contains all the taxonomic information. I can create the files for GTDB and Genbank separately and finally merge them into just one file. I can also download the publicly available genomes (with accompanying metadata) using genome_updater. I just need more details about the artificial sequences you want to include in the database. If you can provide more information, I will create an index file for you with all the relevant genomes and taxonomy.

@JensUweUlrich
Copy link
Owner

Ok it turns out it's not as easy as I initially thought. GTDB includes many different organisms with the same NCBI species taxid. This breaks Taxor's internal data model, which relies on a single unique taxid per species. This requires some refactoring of the code.
I'm also a bit skeptical whether the resolution of k-mer selection schemes like minimizers and syncmers is high enough to distinguish between species that are so close to each other that they have the same species taxid. Do you know how similar those species can be to each other?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants