Allow addition of unreleased INRB data #255

jameshadfield · 2024-05-28T04:48:06Z

These changes are to allow addition of private metadata + sequences to be spiked into the build (after filtering but prior to subsampling). They are intended for use by INRB using an excel + fasta file which they maintain. As such, these changes probably shouldn't be merged into our canonical mpox repo, but perhaps we fork the repo into the INRB organisation and maintain these (small) changes there?

If you have access to the test xlsx + fasta (see this internal slack thread) then you can run from within the phylogenetic directory via:

python scripts/curate_private_data.py --sequences <fasta file> --xlsx <xlsx file> --remap-columns 'accession:accession' 'accession:strain' 'collection date:date' 'province:division' 'health zone:location' --fasta-header-idx 1
nextstrain build . --configfile defaults/clade-i/config.yaml --config private_data=true auspice_config=defaults/clade-i/auspice_config_inrb.json -f results/clade-i/good_metadata_combined.tsv auspice/mpox_clade-I.json

to do

Metadata values need updating (e.g. they have "Homo-sapiens" we have "Homo sapiens")
Fork into INRB repo?
Test from INRB

huddlej

This review is just of the test commands and ability of the workflow to produce a tree and not a full code review or phylogenetic review.

The test commands worked for me from an ambient Nextstrain Conda environment after I ran the following commands:

# Needed to install dependencies not in Nextstrain env.
conda install openpyxl

# Needed to make data dir in fresh clone of mpox repo.
mkdir -p data/

After those commands, both the test commands ran without issue and the output tree from the workflow had the private data included as expected. 🎉

I was surprised to see that the output sequences and metadata from curate_private_data.py had default values instead of required values, but I see that the workflow expects those specific default names. Do you want to allow users to configure those names in the workflow config file? If they change the defaults, the workflow won't work as expected, so you could drop those output options completely and force the defaults.

Other minor comments included below.

phylogenetic/rules/prepare_sequences.smk

phylogenetic/scripts/curate_private_data.py

Consists of two main additions: 1. A script to parse private metadata (xlsx format) and sequences and convert them to a format compatible with this workflow. This was designed specifically for the INRBs use case where Excel is the source-of-truth. This is intended to be run from outside the snakemake pipeline because of the additional dependencies not available in the managed conda nextstrain runtime and also to allow users to inspect the output for any warnings. 2. Merging of metadata/sequences from multiple sources via an additonal rule in the snakemake workflow (only used when `--config private_data=true`). This will be able to be removed once `augur filter` accepts multiple inputs. The data source is one-hot encoded prior to subsampling so that can be referenced as needed.

Parameters and description via email correspondence

huddlej reviewed May 28, 2024

View reviewed changes

phylogenetic/rules/prepare_sequences.smk Show resolved Hide resolved

phylogenetic/scripts/curate_private_data.py Outdated Show resolved Hide resolved

jameshadfield force-pushed the james/inrb branch 2 times, most recently from 33292bb to 02e4c21 Compare May 31, 2024 03:26

jameshadfield added 3 commits June 17, 2024 17:02

Add Clade-I build

73242d1

INRB-specific configs

6b4b47c

Parameters and description via email correspondence

jameshadfield force-pushed the james/inrb branch from ded99fc to 6b4b47c Compare June 17, 2024 08:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow addition of unreleased INRB data #255

Allow addition of unreleased INRB data #255

jameshadfield commented May 28, 2024 •

edited

Loading

huddlej left a comment

Allow addition of unreleased INRB data #255

Are you sure you want to change the base?

Allow addition of unreleased INRB data #255

Conversation

jameshadfield commented May 28, 2024 • edited Loading

huddlej left a comment

Choose a reason for hiding this comment

jameshadfield commented May 28, 2024 •

edited

Loading