Name		Name	Last commit message	Last commit date
parent directory ..
config		config
workflow		workflow
.gitignore		.gitignore
README.md		README.md

README.md

Workflow to create a training dataset

Example dataset (with default config, should take 5 minutes)

Download data from NCBI given a list of accessions, or alternatively, use your own fasta files
Define a set of training intervals, e.g. full chromosomes, only exons (requires annotation), etc
Shard the dataset for efficient loading with Hugging Face libraries
Optional: upload to Hugging Face Hub

Requirements:

GPN
Snakemake
If you want to automatically download data from NCBI, install NCBI Datasets (e.g. conda install -c conda-forge ncbi-datasets-cli)

Choosing species/assemblies (ignore if using your own set of fasta files):

Manually download assembly metadata from NCBI Genome
You can choose a set of taxa (e.g. mammals, plants) and apply filters such as annotation level, assembly level.
Checkout the script gpn/ss/filter_assemblies.py for more details, such as how to subsample, or how to keep only one assembly per genus.

Configuration:

See config/config.yaml and config/assemblies.tsv
Check notes in workflow/Snakefile for running with your own set of fasta files

Running:

snakemake --cores all
The dataset will be created at results/dataset

Uploading to Hugging Face Hub:

For easy distribution and deployment, the dataset can be uploaded to HF Hub (optionally, as a private dataset). It can be automatically streamed during training (no need to fully download the data locally). Make sure to first install HF Hub client library.

from huggingface_hub import HfApi
api = HfApi()

private = False
repo_id = "gonzalobenegas/example_dataset"  # replace with your username, dataset name
folder_path = "results/dataset"
api.create_repo(repo_id=repo_id, repo_type="dataset", private=private)
api.upload_folder(repo_id=repo_id, folder_path=folder_path, repo_type="dataset")

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

make_dataset

make_dataset

README.md

Workflow to create a training dataset

Requirements:

Choosing species/assemblies (ignore if using your own set of fasta files):

Configuration:

Running:

Uploading to Hugging Face Hub:

Files

make_dataset

Directory actions

More options

Directory actions

More options

Latest commit

History

make_dataset

Folders and files

parent directory

README.md

Workflow to create a training dataset

Requirements:

Choosing species/assemblies (ignore if using your own set of fasta files):

Configuration:

Running:

Uploading to Hugging Face Hub: