🚀 MR.PARETO - Modules & Recipes for Pragmatic Augmentation of Research Efficiency Towards Optimum

"For many outcomes, roughly 80% of consequences come from 20% of causes (the "vital few")." - The Pareto Principle by Vilfredo Pareto

Get 80% of all standard (biomedical) data science analyses done semi-automated with 20% of the effort, by leveraging Snakemake's module functionality to use and combine pre-existing workflows into arbitrarily complex analyses.

⏳ TL;DR - More Time for Science!

"Programming is about trying to make the future less painful. It’s about making things easier for our teammates." from The Pragmatic Programmer by Andy Hunt & Dave Thomas

Why: Time is the most precious resource. By taking care of efficiency (i.e., maximum output with limited resources) scientists can re-distribute their time to focus on effectiveness (i.e., the biggest impact possible).
How: Leverage the latest developments in workflow management to (re-)use and combine independent computational modules into arbitrarily complex analyses in combination with modern innovation methods (e.g., fast prototyping, design thinking, and agile concepts).
What: Independent computational Modules implemented as Snakemake workflows, encoding best practices and standard approaches, are used to scale, automate, and parallelize analyses. Snakemake's module functionality enables arbitrarily complex combinations of pre-existing modules for any Project. Recipes combine modules into the most conceivable standard analyses, thereby accelerating projects to the point of the unknown.

Altogether this enables complex, portable, transparent, reproducible, and documented analysis of biomedical data analysis at scale.

🧩 Modules

"Is it functional, multifunctional, durable, well-fitted, simple, easy to maintain, and thoroughly tested? Does it provide added value, and doesn't cause unnecessary harm? Can it be simpler? Is it an innovation?" - Patagonia Design Principles

Modules are Snakemake workflows, consisting of Rules for multi-step analyses, that are independent and self-contained. A {module} can be general-purpose (e.g., Unsupervised Analysis) or modality-specific (e.g., ATAC-seq processing). Currently, the following nine modules are available, ordered by their applicability from general to specific:

Module	Type (Data Modality)	DOI	Stars
Unsupervised Analysis	General Purpose (tabular data)
Split, Filter, Normalize and Integrate Sequencing Data	Bioinformatics (NGS counts)
Differential Analysis with limma	Bioinformatics (NGS data)
Enrichment Analysis	Bioinformatics (genes/genomic regions)
Genome Track Visualization	Bioinformatics (aligned BAM files)
ATAC-seq Processing	Bioinformatics (ATAC-seq)
scRNA-seq Processing using Seurat	Bioinformatics (scRNA-seq)
Differential Analysis using Seurat	Bioinformatics (scRNA-seq)
Perturbation Analysis using Mixscape from Seurat	Bioinformatics (scCRISPR-seq)

Note

⭐️ Star and share modules you find valuable 📤 — help others discover them, and guide our focus for future work!

Tip

For detailed instructions on the installation, configuration, and execution of modules, you can check out the wiki. Generic instructions are also shown in the modules' respective Snakmake workflow catalog entry.

📋 Projects using multiple Modules

“Absorb what is useful. Discard what is not. Add what is uniquely your own.” - Bruce Lee

You can (re-)use and combine pre-existing workflows within your projects by loading them as Modules since Snakemake 6. The combination of multiple modules into projects that analyze multiple datasets represents the overarching vision and power of MR.PARETO.

Note

When applied to multiple datasets within a project, each dataset should have its own result directory within the project directory.

Three components are required to use a module within your Snakemake workflow (i.e., a project).

Configuration: The config/config.yaml file has to point to the respective configuration files per dataset and workflow.

#### Datasets and Workflows to include ###
workflows:
    MyData:
        other_workflow: "config/MyData/MyData_other_workflow_config.yaml"

Snakefile: Within the main Snakefile (workflow/Snakefile) we have to:
- load all configurations;
- include the snakefiles that contain the dataset-specific loaded modules and rules (see next point);
- and add all modules' outputs to the target's rule input.

Modules: Load the required modules and their rules within separate snakefiles (*.smk) in the rule/ folder. Recommendation: Use one snakefile per dataset.

  module MyData_other_workflow:
    # here, plain paths, URLs and the special markers for code hosting providers (e.g., github) are possible.
    snakefile: "other_workflow/Snakefile"
    config: config["MyData_other_workflow"]

  use rule * from MyData_other_workflow as MyData_other_workflow_*

Tip

A full tutorial is available on the wiki.

📜 Recipes

"Civilization advances by extending the number of important operations which we can perform without thinking of them." - Alfred North Whitehead, author of Principia Mathematica

Recipes are combinations of existing modules into end-to-end best practice analyses. They can be used as templates for standard analyses by leveraging existing modules, thereby enabling fast iterations and progression into the unknown. Every recipe is described and presented using a wiki page by application to a public data set.

Tip

Process each dataset module by module. Check the results of each module to inform the configuration of the next module. This iterative method allows for quick initial completion, followed by refinement in subsequent iterations based on feedback from yourself or collaborators. Adjustments in later iterations are straightforward, requiring only changes to individual configurations or annotations. Ultimately you end up with a reproducible and readable end-to-end analysis for each dataset.

Recipe	Description	# Modules	Results
ATAC-seq Analysis	From raw BAM files to enrichemnts of differentially accessible regions.	6(-7)	...
RNA-seq Analysis	From raw BAM files to enrichemnts of differentially expressed genes.	6(-7)	...
Integrative ATAC-seq & RNA-seq Analysis	From count matrices to epigenetic potential and relative transcriptional abundance.	7(-8)	...
scRNA-seq Analysis	From count matrix to enrichemnts of differentially expressed genes.	5(-6)	...
scCRISPR-seq Analysis	From count matrix to knockout phenotype enrichemnts.	6(-7)	...

Note

⭐️ Star this repository and share recipes you find valuable 📤 — help others discover them, and guide our focus for future work!

📚 Resources

MR.PARETO Wiki for instructions & tutorials
GitHub list of MR.PARETO modules
My Data Science Setup - A tutorial for developing Snakemake workflows and beyond
GitHub Page of this repository
Curated and published workflows that could be used as modules:
Software
- Snakemake
- Conda
- Docker
- Singularity

Name		Name	Last commit message	Last commit date
Latest commit History 74 Commits
config		config
data/myData		data/myData
workflow		workflow
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README_template.md		README_template.md
_config.yml		_config.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🚀 MR.PARETO - Modules & Recipes for Pragmatic Augmentation of Research Efficiency Towards Optimum

⏳ TL;DR - More Time for Science!

🧩 Modules

📋 Projects using multiple Modules

📜 Recipes

📚 Resources

⭐ Star History of Modules

About

Releases

Contributors 2

Languages

License

epigen/mr.pareto

Folders and files

Latest commit

History

Repository files navigation

🚀 MR.PARETO - Modules & Recipes for Pragmatic Augmentation of Research Efficiency Towards Optimum

⏳ TL;DR - More Time for Science!

🧩 Modules

📋 Projects using multiple Modules

📜 Recipes

📚 Resources

⭐ Star History of Modules

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Contributors 2

Languages