sciRNAseq3 pipeline

A Snakemake pipeline for processing single-cell combinatorial indexing RNA sequencing (sci-RNA-seq) data.

sci-RNA-seq is described here:

Martin, B.K., Qiu, C., Nichols, E. et al. Optimized single-nucleus transcriptional profiling by combinatorial indexing. Nat Protoc 18, 188–207 (2023). https://doi.org/10.1038/s41596-022-00752-0

The pipeline presented here was developed independently.

Installation

On rackham (UPPMAX)

On the Swedish UPPMAX cluster rackham, Conda is preinstalled with Bioconda pre-configured, so these commands suffice:

module load conda
conda env create -n scirnaseq -f environment.yml
conda activate scirnaseq

The first and last step need to be repeated when starting a new shell (for example, after logging in again).

Everywhere else

Install Conda
Configure the Bioconda channel

Create the Conda environment:

conda env create -n scirnaseq -f environment.yml

Activate the environment:
```
conda activate scirnaseq
```

The last step of activating the environment needs to be repeated when starting a new shell (for example, after logging in again).

Before running the pipeline

The reference and annotations need to be prepared. This only needs to be done once. These commands work for GRCm39:

mkdir -p ref/GRCm39
cd ref/GRCm39
wget https://ftp.ensembl.org/pub/release-110/fasta/mus_musculus/dna/Mus_musculus.GRCm39.dna.primary_assembly.fa.gz
wget https://ftp.ensembl.org/pub/release-110/gtf/mus_musculus/Mus_musculus.GRCm39.110.gtf.gz
gunzip Mus_musculus.GRCm39.dna.primary_assembly.fa.gz
gunzip Mus_musculus.GRCm39.110.gtf.gz
STAR --runThreadN 8 --runMode genomeGenerate --genomeDir star --genomeFastaFiles Mus_musculus.GRCm39.dna.primary_assembly.fa --sjdbGTFfile Mus_musculus.GRCm39.110.gtf
cd ../..

Running the pipeline

Prepare the pipeline run directory

Create a new directory somewhere and change into it:
```
mkdir run01
cd run01
```
Make the compressed FASTQ files available in a raw-reads/ subdirectory. The files must have the extension .fastq.gz and contain the text _R1_ in their name. We recommended using symbolic links.
Make the indexed reference available, for example using a symbolic link:
```
ln -s ../ref ref
```
Create these files:
- ligation-indices-with-linker.fasta
- ligation-indices.fasta
- p7-indices.fasta
- rt-indices.fasta
Create a samplesheet in TSV format named samples.tsv: It needs to have these columns:
- Barcode: The 10-nt RT index
- Sample: Sample name
- Replicate: Replicate name (for example, use numbers 1, 2, 3, ...)

Running Snakemake

Do a dry-run to test whether everything is set up correctly:

snakemake --dry-run -p -s path/to/Snakefile

Replace path/to/Snakefile appropriately (for example, ../Snakefile). If there are no error messages, run the pipeline:

snakemake -p --cores=all -s path/to/Snakefile

Pipeline result files

All result files are placed into a newly created out/ directory. These are the relevant ones:

report.html: QC report
p7-mismatches.tsv: Statistics with P7 index mismatches. Also shown in report.html.
rds/: Directory with Seurat objects (.rds files), one for each sample and one named allsamples.rds containing all samples.
star/Solo.out/Gene/filtered/: Directory with Seurat-compatible counts matrix with overall counts. This is not split by sample and uses default STAR solo filtering.
filtered/: Count matrix using adjusted STAR solo filtering that should result in more cells.
samples.pdf: Various plots:
- Number of cells per sample and replicate
- Proportion of replicate per sample
- nFeature_RNA
- nCount_RNA
- Top expressed genes

Read structure

See the Nature Protocols paper (https://doi.org/10.1038/s41596-022-00752-0) and also https://teichlab.github.io/scg_lib_structs/methods_html/sci-RNA-seq3.html.

Structure of R1

#{9,10}CAGAGCN{8}#{10}

# is part of an index, N is part of UMI

9-10 nt ligation index (some have 9 nt, some have 10 nt)
6 nt linker
8 nt UMI
10 nt RT index

Total: 33 or 34 nt. Sequenced reads have 34 nt.

Pipeline overview

Input data is assumed to already have been demultiplexed by P7 index (I1)
Cutadapt is run twice to move ligation index, UMI and RT index from R2 into the read header. R2 is empty afterwards and discarded.
Script simulatecb.py simulates R2 reads similar to 10X Chromium reads. The sequence consists of:
- RT index (10 nt)
- Ligation index (10 nt, those that have only 9 nt have an A added to them)
- P7 index (10 nt)
- UMI
STAR solo is run. It is configured to interpret the first 30 nt as cell barcode.

Links

The original pipeline published along with the Nature Protocols paper: https://github.com/JunyueC/sci-RNA-seq3_pipeline

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
LICENSE		LICENSE
README.md		README.md
Snakefile		Snakefile
allowedbarcodes.py		allowedbarcodes.py
environment.yml		environment.yml
filtercells.py		filtercells.py
report.qmd		report.qmd
sample_split.R		sample_split.R
saturation.py		saturation.py
simulatecb.py		simulatecb.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

sciRNAseq3 pipeline

Installation

On rackham (UPPMAX)

Everywhere else

Before running the pipeline

Running the pipeline

Prepare the pipeline run directory

Running Snakemake

Pipeline result files

Read structure

Structure of R1

Other reads

Pipeline overview

Links

About

Releases

Packages

Contributors 2

Languages

License

NBISweden/scirnaseq

Folders and files

Latest commit

History

Repository files navigation

sciRNAseq3 pipeline

Installation

On rackham (UPPMAX)

Everywhere else

Before running the pipeline

Running the pipeline

Prepare the pipeline run directory

Running Snakemake

Pipeline result files

Read structure

Structure of R1

Other reads

Pipeline overview

Links

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages