Mapping-friendly Sequence Reduction function discovery

Introduction

This is the repository linked to the "Mapping-friendly sequence reductions: going beyond homopolymer compression" paper.
It contains the necessary pipelines, tools and information in order to rerun the analyses performed during the project as well as explore this subject further.

Quick start

This is made to work on Linux or MacOS systems. You must first clone this repository with the --recursive flag in order to also get the tools included as submodules. You can then run the init script that builds the tools that need to be compiled and makes sure the various data files and pre-trained models are in the right place, the --data flag ensure the reference genomes are downloaded as well. You should then be able to run the pipelines.

git clone --recursive git@github.com:lucblassel/MSR_discovery.git
cd MSR_discovery
./init.sh --data
nextflow run <pipeline-name>

Dependencies

You will need the following dependencies available on your system in order to run the pipelines:

gcc >= 10 to compile winnowmap on MacOs. You can install it with HomeBrew on MacOs: brew install gcc@10
zlib development file to compile minimap
k8 javascript shell to run our fork of paftools.js
Nextflow in order to execute the pipelines

Pipelines

All the pipelines are implemented in NextFlow. You can run them by using: nextflow run <pipeline-name>.
A sample config file nextflow.config is available for you to fill out and adapt to your HPC environment.

SSR evaluation

nextflow run msr_selection.nf -resume

This pipeline executes the following steps:

generate all SSRs as described in the paper and saves them to the data/SSRs directory
evaluate each SSR using simulated reads and saves the resulting mapping paf file and mapeval ouput file to the results/SSR_eval/[SSR] directory
generate a gathered csv file of mapeval outputs for all evaluated SSRs in the results/SSR_eval/evaluations.csv file.

Using this file you can select the top MSRs using the following command:

bin/selectMSRs.py \
  --csv results/SSR_eval/evaluations.csv \
  --top 20 \
  --dir data/SSRs/

This will perform the selection method described in the paper and a subdirectory in the data/SSRs directory called MSRs with symbolic links to the relevant SSR json files. It will also create a text file which lists the best MSR in each selection category.
If the --dir flag is not specified this script will create a json file in the current directory listing the selected MSRs as well as the best in each category.

MSR In depth evaluation [TODO]

The previously selected MSRs in data/SSRs/MSRs are evaluated on a wider range of use cases.
For a given reference, a set of reads is simulated with a coverage of approximately 1.5. The reads and reference are transformed with each MSR, then the transformed reads are mapped to transformed reference. The resulting mapping is then evaluated.
This process is done for each possible combination of:

Reference:
- T2T CHM13 v1.1 whole human genome reference
- TandemTools simulated centromeric reference
- Whole Drosophila melanogaster reference
- Whole Escherischia coli reference
Simulator:
- NanoSim with R94 model
- PBSim with P6C4 model
Mapper:
- minimap2
- winnowmap

Each possible mapping is evaluated using the mapeval command in our fork of paftools.js.
Additionaly, a mapping is evaluated on a subset of reads that are mapped to repeated regions of the genome. A single .csv file with all the evaluation is produced and can be used to generate plots and tables.

Plotting and Tables

All the necessary result data as well as an R-markdown file are available in the figure_generation directory. With this you should be able to reproduce all the figures and tables used in the article (both in the main text and supplementary material).

Included tools

We have made the choice of including all the tools used in this project as git submodules when possible.

Read simulation
- Nanosim
- PbSim2
Mapping
- Minimap2
- Winnowmap
Read manipulation:
- fastatools
- lucblassel/rename_sequences
- lucblassel/reduce_sequences
result file manipulation:
- our custom fork of paftools
- bedtools
- bigBedToBed

The init.sh script will set up the tools before you run the pipeline, it currently only supports amd64 architectures on Linux and MacOS. It will build:

Winnowmap on MacOS/Linux
PBSim on MacOS/Linux
Minimap2 on MacOS
Bedtools on MacOS

It will download and setup prebuilt binaries for:

Minimap2 on Linux
Bedtools on Linux
Go reduce_sequences and rename_sequences on MacOS/Linux

If you specify the --data flag it will also download the reference datasets used in the analysis to the correct directories, and pre-process it for the pipelines to work.

If you already have binaries for the aforementioned tools, you can place them in the bin directory and the init.sh script will skip those.

Name		Name	Last commit message	Last commit date
Latest commit History 84 Commits
bin		bin
figure_generation		figure_generation
tools		tools
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md
init.sh		init.sh
msr_evaluation.nf		msr_evaluation.nf
msr_selection.nf		msr_selection.nf
nextflow.config		nextflow.config

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Mapping-friendly Sequence Reduction function discovery

Introduction

Quick start

Dependencies

Pipelines

SSR evaluation

MSR In depth evaluation [TODO]

Plotting and Tables

Included tools

About

Releases 1

Packages

Languages

lucblassel/MSR_discovery

Folders and files

Latest commit

History

Repository files navigation

Mapping-friendly Sequence Reduction function discovery

Introduction

Quick start

Dependencies

Pipelines

SSR evaluation

MSR In depth evaluation [TODO]

Plotting and Tables

Included tools

About

Resources

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages