Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sourmash profiler [DO NOT MERGE YET] #404

Draft
wants to merge 6 commits into
base: bouncy-basenji
Choose a base branch
from

Conversation

vmikk
Copy link
Member

@vmikk vmikk commented Oct 17, 2023

This PR adds sourmash as an additional profiler.
Related to #112

NOTES:

  1. Database
    Sourmash supports several types of databases.
    For taxprofiler, I propose to use a single ZIP file containing signatures, but the database also requires a CSV file with taxonomic information (gzip-compessed csv is also supported). So the tar file with database should contain two files:
test-db-sourmash
├── sourmash-db.zip
└── lineages.csv.gz

For now, file names are hardcoded .

  1. As a first step, sourmash creates FracMinHash sketches (signatures) for each sample.
    This step is independent of the database, so we need to do sketching only once. Therefore, I removed the database from the input channel (ch_input_for_profiling.sourmash.map). Otherwise, it will perform independent sketching for each database provided and we will have lots of duplicated samples, isn't it?

  2. Sourmash can create 4 types of signatures: DNA, protein, protein translated from DNA, and signatures based on CSV file with locations to genomes/proteomes.
    The sourmash/sketch module is written to support all these input types. Therefore, it is required to pass extra args to the process. The esieast way is to specify it in the config, e.g.:

process {
    withName: SOURMASH_SKETCH {
        ext.args = "dna --param-string 'k=31,scaled=1000,noabund'"
    }
}

, where the first word in ext.args should be dna, protein, translate, or fromfile.

TO DO list

  • upload test dataset to nf-core/test-datasets#taxprofiler branch
  • ...

PR checklist

  • This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests!
  • If you've added a new tool - have you followed the pipeline conventions in the contribution docs
  • If necessary, also make a PR on the nf-core/taxprofiler branch on the nf-core/test-datasets repository.
  • Make sure your code lints (nf-core lint).
  • Ensure the test suite passes (nextflow run . -profile test,docker --outdir <OUTDIR>).
  • Usage Documentation in docs/usage.md is updated.
  • Output Documentation in docs/output.md is updated.
  • CHANGELOG.md is updated.
  • README.md is updated (including new tool citations and authors/contributors).

@github-actions
Copy link

github-actions bot commented Oct 20, 2023

nf-core lint overall result: Failed ❌

Posted for pipeline commit 58c4207

+| ✅ 157 tests passed       |+
!| ❗   3 tests had warnings |!
-| ❌   1 tests failed       |-

❌ Test failures:

  • schema_params - Param run_sourmash from nextflow config not found in nextflow_schema.json

❗ Test warnings:

  • nextflow_config - Config manifest.version should end in dev: 1.1.2
  • pipeline_todos - TODO string in main.nf: Remove this line if you don't need a FASTA file
  • pipeline_todos - TODO string in methods_description_template.yml: #Update the HTML below to your preferred methods description, e.g. add publication citation for this pipeline

✅ Tests passed:

Run details

  • nf-core/tools version 2.10
  • Run at 2023-10-27 12:24:37

@jfy133 jfy133 marked this pull request as draft October 22, 2023 04:26
@jfy133 jfy133 added the WIP Work in progress label Oct 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Improvement for existing functionality WIP Work in progress
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants