Below is the tutorial illustrating the way of processing of bacterial RNA-seq experiments in --simple
and --multi
modes.
If you have RNA-seq experiments done in 1 strain, or strains that differ by a mutation/gene deletion/plasmid, use --simple
with the reference that includes all elements of interest (e.g. if strains differ by a plasmid, make sure the plasmid is included in your reference sequence). If your strains differ substantially (e.g. multiple prophages and plasmids), use --multi
mode. Both workflows require at least one well-annotated reference strain, and work best with complete genome assemblies (i.e. full assembled chromosomes, and not contigs).
Before starting this tutorial, install bacpipe
and all dependencies as described here.
In order to illustrate the processing of RNA-seq using --simple
workflow, we will use the dataset GSE50184, studying the influence of misR regulon in Neisseria gonorrhoeae. The dataset consists of 6 samples - 3 replicates each of the wildtype and misR knockout.
First of all, let's choose a directory with plenty of space and enter it.
mkdir Ngo_tutorial
cd Ngo_tutorial
After this, download the fastq files using fastq-dump
tool from NCBI sra-tools
from NCBI (If you get an error about unknown option, replace --split-3
with --split-e
in the commands below):
for i in `seq 50 55`
do
fastq-dump --split-3 SRR9570$i &
done
wait
After all of the downloading is done, rename and gzip the fastq files:
gzip -c SRR957050.fastq > WT_rep1.fastq.gz &
gzip -c SRR957051.fastq > WT_rep2.fastq.gz &
gzip -c SRR957052.fastq > WT_rep3.fastq.gz &
gzip -c SRR957053.fastq > misR_KO_rep1.fastq.gz &
gzip -c SRR957054.fastq > misR_KO_rep2.fastq.gz &
gzip -c SRR957055.fastq > misR_KO_rep3.fastq.gz &
wait
rm SRR*fastq
While read download and processing is happening, let's download and rename the reference genome and its annotation. We will use strain FA19 GenBank assembly and annotation.
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/047/225/GCA_001047225.1_ASM104722v1/GCA_001047225.1_ASM104722v1_genomic.fna.gz
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/047/225/GCA_001047225.1_ASM104722v1/GCA_001047225.1_ASM104722v1_genomic.gff.gz
gzip -d GCA*gz
mv GCA_001047225.1_ASM104722v1_genomic.fna FA19.fa
mv GCA_001047225.1_ASM104722v1_genomic.gff FA19.gff
Additionally, let's make a prophage interval file. As a crude estimate, we can use PHASTER results, that predict the presence of 4 prophage-like elements in FA19:
CP012026.1 191612 239575 Ph1
CP012026.1 451221 474055 Ph2
CP012026.1 851813 880939 Ph3
CP012026.1 1367842 1402902 Ph4
Simply use any text editor, copy the table above, and save it as a text file named FA19.prophage.bed. Alternatively, download the file from here.
After all the files are downloaded and renamed, set up the directory structure and move files to appropriate sub-directories:
mkdir fastqs study_strains
mv *.fastq.gz fastqs
mv FA19.* study_strains
We should be all set to go. Let's run the reference preparation in --simple
mode. Make sure you are in your main working directory - Ngo_tutorial:
prepare_bacterial_reference --simple . FA19 -p 8
Your logs should look something like this:
==> Initiating bacpipe reference preparation for SIMPLE SINGLE-STRAIN workflow!
==> Following variables were set:
WDIR: .
TAG: FA19
CPUS: 8
==> Parallel jobs will be ran on 8 cores.
Directory /pub37/alexp/data/rnaseq/for_bacpipe_paper/ngon_misR_GSE50184/study_strains/FA19 was not found and will be created.
==> Running Prokka annotation to identify tRNA and rRNA loci
FA19.gff annotation file processed; found 0 gene entries without a locus tag, for which new locus tags were generated.
Parsed GFF stats: 2220 protein coding, 0 pseudogenes, 1 noncoding RNAs, 56 tRNAs, 12 rRNAs, 0 others.
==> Files FA19.genome.fa and FA19.gene.gff successfully generated
==> STAR aligner index FA19.STAR successfully generated
==> rRNA/tRNA operon interval file FA19.rRNA.bed successfully created
All the generated files and indexes have been moved to /pub37/alexp/data/rnaseq/for_bacpipe_paper/ngon_misR_GSE50184/study_strains/FA19.
Strain FA19: all done generating reference!
==> Checking study strain FA19:
- all necessary annotation files and directories for FA19 exist and are not empty.
==> Writing simple config file, simple.cfg
==> DONE generating simple single-strain reference!
Note the statistics of the GFF file (2220 protein coding, 0 pseudogenes, 1 noncoding RNAs, 56 tRNAs, 12 rRNAs, 0 others). If this is different from what you expect for your strain, examine the GFF file for the reason of such discrepancies.
You will find that inside study_strains subdirectory, there is now a directory named FA19 that contains all the neccessary reference files (STAR index, Prokka annotation, rRNA intervals, etc). You will also find simple.cfg config file written in your main working directory.
We're all ready to go now! Go to Ngo_tutorial and run bacpipe like this (adjust the number of cores used according to what's available on your system):
bacpipe . simple.cfg -p 64 | tee bacpipe.log
This will write the bacpipe.log file while printing logs to the screen at the same time. If you don't want to monitor the progress real-time, you can simply replace the command with
bacpipe . simple.cfg -p 64 &> bacpipe.log
After this, you should obtain several folders with the processed RNA-seq data.
- FastQC contains the output of fastqc intial read quality control;
- bams contains rRNA/tRNA-filtered indexed BAM files used for visualization, as well as STAR alignment logs;
- tdfs_and_bws contain TDF files (to be visualized in IGV), and straind-specific, normalized bigWig files (can be visualized in almost any genome browser);
- strand and featureCounts contain the results of
featureCounts
quantification. The former contains tables for all possible strand-specificities, and the latter contains only the quantificaion for the correct strand; - exp_tables contains annotated and non-annotated expression tables in the form of raw counts and TPMs;
- stats contains statistics calculated per individual sample. A summary of these statistics is given in the final log.
Strand specificity is reported in the log:
Average percent of reads matching the coding strand: 50.5977, lowest: 49.963, highest: 51.220
This particular RNA-seq experiment was not done in a strand-specific manner.
Read number statistics are reported above:
Sample | Strain | All | rRNA | unmapped | multimap | 1_loc | assigned |
---|---|---|---|---|---|---|---|
misR_KO_rep1 | FA19 | 18692365 | 15551153 | 301948 | 117585 | 2721679 | 2740861 |
misR_KO_rep2 | FA19 | 7644453 | 5448011 | 285936 | 76984 | 1833522 | 1839844 |
misR_KO_rep3 | FA19 | 9183811 | 6997228 | 196893 | 78893 | 1910797 | 1930790 |
WT_rep1 | FA19 | 8293165 | 4336837 | 605150 | 141683 | 3209495 | 3316358 |
WT_rep2 | FA19 | 7730590 | 4930697 | 119528 | 119860 | 2560505 | 2584079 |
WT_rep3 | FA19 | 11372201 | 5711041 | 1346176 | 206595 | 4108389 | 4211370 |
This shows us the raw read numbers (read pairs for paired-end experiments). In this case we see that the experiments were not fully rRNA-depleted, with 8-18 M initial reads and 2-4 M reads assigned to annotated genes. Percentages of the initial reads are given below:
Sample | Strain | rRNA | unmapped | multimap | 1_loc | assigned |
---|---|---|---|---|---|---|
misR_KO_rep1 | FA19 | 83.195 | 1.615 | 0.629 | 14.560 | 14.663 |
misR_KO_rep2 | FA19 | 71.268 | 3.740 | 1.007 | 23.985 | 24.068 |
misR_KO_rep3 | FA19 | 76.191 | 2.144 | 0.859 | 20.806 | 21.024 |
WT_rep1 | FA19 | 52.294 | 7.297 | 1.708 | 38.700 | 39.989 |
WT_rep2 | FA19 | 63.782 | 1.546 | 1.550 | 33.122 | 33.427 |
WT_rep3 | FA19 | 50.219 | 11.837 | 1.817 | 36.127 | 37.032 |
The percentage of reads that don't map to rRNA/tRNA (multimap + 1_loc) is pretty close to "assigned" - meaning that the annotation does not miss any strongly expressed features. A larger difference would indicated that the annotation is missing some strongly expressed features (most commonly, ncRNA).
After you've successfully completed the processing, you can use the obtained expression tables to get PCA, clustering, and differential expression analysis. A quick and interactive processing can be done in Phantasus. Taking annotated table of counts (FA19.annotated.counts.tsv), we can upload the table to the web tool, normalize it (Adjust -> log2-transform -> quantile normalize), and run PCA analysis (Tools -> Plots -> PCA plot). This should give us the following plot, indicating clear separation of mutants from WT samples:
Quick differential expression using limma (Tools -> Differential expression -> limma) generates the following picture, identifying genes up- and down-regulated in misR mutants:
Overall, bacpipe
streamlines the processing of simple, single-strain RNA-seq experiments.
Processing differs significantly for multiple strains. In order to demonstrate how it works, we will use a small part of GSE119724 dataset, in which expression in Salmonella Typhimurium strains 4/74 and D23580 (often shortened to D23) was compared under various in vitro conditions and inside macrophages. For simplicity, we will only use 6 samples of bacteria grown under in vitro conditions that induce SPI-2 gene expression (low pH, low phosphorus) - InSPI2.
First of all, let's choose a directory with plenty (~ 50 Gb) of space and enter it.
mkdir Stm_tutorial
cd Stm_tutorial
After this, download the fastq files using fastq-dump
tool from NCBI sra-tools
from NCBI (If you get an error about unknown option, replace --split-3
with --split-e
in the commands below):
for i in `seq 12 14` `seq 45 47`
do
fastq-dump --split-3 SRR78141$i &
done
wait
After all of the downloading is done, let's give the files more informative names and compress them, deleting the original SRR files:
gzip -c SRR7814112.fastq > D23_InSPI2_rep1.fastq.gz &
gzip -c SRR7814113.fastq > D23_InSPI2_rep2.fastq.gz &
gzip -c SRR7814114.fastq > D23_InSPI2_rep3.fastq.gz &
gzip -c SRR7814145.fastq > 474_InSPI2_rep1.fastq.gz &
gzip -c SRR7814146.fastq > 474_InSPI2_rep2.fastq.gz &
gzip -c SRR7814147.fastq > 474_InSPI2_rep3.fastq.gz &
wait
rm SRR*fastq
After this, let's download genome assemblies and GFF annotations (GenBank versions) for reference strains. We will use old D23580, LT2, and SL1344 (in order to be used as a reference strain, it needs to have a genomic GFF3 file available). For study strains, we will use the new assembly of D23580, and 4/74. It's also OK to use the same strain as a study and reference strain, but it's best to give them different strain tags. Here we will add "_ref" to reference strain tags:
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/027/025/GCA_000027025.1_ASM2702v1/GCA_000027025.1_ASM2702v1_genomic.fna.gz
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/027/025/GCA_000027025.1_ASM2702v1/GCA_000027025.1_ASM2702v1_genomic.gff.gz
gzip -d GCA*gz
mv GCA_000027025.1_ASM2702v1_genomic.fna D23_ref.fa
mv GCA_000027025.1_ASM2702v1_genomic.gff D23_ref.gff
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/210/855/GCA_000210855.2_ASM21085v2/GCA_000210855.2_ASM21085v2_genomic.fna.gz
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/210/855/GCA_000210855.2_ASM21085v2/GCA_000210855.2_ASM21085v2_genomic.gff.gz
gzip -d GCA*gz
mv GCA_000210855.2_ASM21085v2_genomic.fna SL1344_ref.fa
mv GCA_000210855.2_ASM21085v2_genomic.gff SL1344_ref.gff
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/006/945/GCA_000006945.2_ASM694v2/GCA_000006945.2_ASM694v2_genomic.fna.gz
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/006/945/GCA_000006945.2_ASM694v2/GCA_000006945.2_ASM694v2_genomic.gff.gz
gzip -d GCA*gz
mv GCA_000006945.2_ASM694v2_genomic.fna LT2_ref.fa
mv GCA_000006945.2_ASM694v2_genomic.gff LT2_ref.gff
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/188/735/GCA_000188735.1_ASM18873v1/GCA_000188735.1_ASM18873v1_genomic.fna.gz
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/900/538/085/GCA_900538085.1_D23580_liv/GCA_900538085.1_D23580_liv_genomic.fna.gz
gzip -d GCA*gz
mv GCA_000188735.1_ASM18873v1_genomic.fna 474.fa
mv GCA_900538085.1_D23580_liv_genomic.fna D23.fa
Also download pre-made prophage BED files for D23 and 474. These are the intervals generated by curated prophage annotation; a cruder, much faster estimate can be generated using Phaster online server.
Finally, let's download a file of Salmonella-specific ncRNAs, small ORFs, and pseudogenes/selenocysteine genes - St_ncNRA_sORF.fa.
After all the necessary files have been downloaded, renamed, and compressed, let's place them in the appropriate folders:
mkdir study_strains ref_strains fastqs
mv *fastq.gz fastqs
mv *_ref.fa *ref.gff ref_strains
mv 474.fa D23.fa *.bed study_strains
The extra reference file St_ncNRA_sORF.fa we've downloaded previously should remain in the main working directory.
Now, let's make a config file. Using your favourite text editor, make a text file named mult.cfg and put the following (tab-separated) lines in it:
D23_InSPI2_rep1 D23
D23_InSPI2_rep2 D23
D23_InSPI2_rep3 D23
474_InSPI2_rep1 474
474_InSPI2_rep2 474
474_InSPI2_rep3 474
Reference LT2_ref
Reference D23_ref
Reference SL1344_ref
We should now be ready to make all the necessary reference files. Let's run the reference preparation in --multi
mode. Make sure you are in your main working directory - Stm_tutorial:
prepare_bacterial_reference --multi . mult.cfg -p 16 -r St_ncNRA_sORF.fa
The logs will indicate many performed operations and quality controls. Most important things to note are the following messages:
====> Merging blast-based annotation with Prokka-predicted features for strain 474:
...
==> Found 4867 protein-coding (CDS) and 280 non-coding RNA (misc_RNA/ncRNA) features.
====> Merging blast-based annotation with Prokka-predicted features for strain D23:
...
==> Found 4897 protein-coding (CDS) and 287 non-coding RNA (misc_RNA/ncRNA) features.
...
==> Directory /pub37/alexp/data/rnaseq/for_bacpipe_paper/jay_tut/ref_strains/D23_ref was not found and will be created.
...
Output GFF: 4446 CDS, 75 pseudogenes (as CDS), 0 tRNA, 0 rRNA, 0 ncRNA, and 0 others.
Total number of features in the output: 4521.
...
==> Directory /pub37/alexp/data/rnaseq/for_bacpipe_paper/jay_tut/ref_strains/LT2_ref was not found and will be created.
...
Output GFF: 4554 CDS, 39 pseudogenes (as CDS), 85 tRNA, 22 rRNA, 11 ncRNA, and 6 others.
Total number of features in the output: 4717.
...
==> Directory /pub37/alexp/data/rnaseq/for_bacpipe_paper/jay_tut/ref_strains/SL1344_ref was not found and will be created.
...
Output GFF: 4672 CDS, 72 pseudogenes (as CDS), 0 tRNA, 0 rRNA, 0 ncRNA, and 0 others.
Total number of features in the output: 4744.
...
==> Performing Roary output reformatting; results will be written to roary/presence_absence_unix.csv
==> Generating a table of CDS orthologs.
...
==> DONE generating multi-strain reference!
After this is complete, you should find the file orthologs.tsv in your main working directory. The file contains 5667 lines (+header), of which
- 5367 are CDS
- 290 are ncRNA
- 7 are misc (pseudogenes etc)
- 3 are repeat_region (CRISPR repeats identified by Prokka)
Of these,
- 4936 are located on (non-prophage part of) the chromosome
- 398 are located in a prophage region;
- 333 are located on a plasmid.
We're all ready to go now! Go to Stm_tutorial and run bacpipe like this (adjust the number of cores used according to what's available on your system):
bacpipe . mult.cfg -p 64 -k LT2_ref | tee bacpipe.log
This will write the bacpipe.log file while printing logs to the screen at the same time. If you don't want to monitor the progress real-time, you can simply replace the command with
bacpipe . mult.cfg -p 64 -k LT2_ref &> bacpipe.log
After this, you should obtain several folders with the processed RNA-seq data.
- FastQC contains the output of fastqc intial read quality control;
- bams contains rRNA/tRNA-filtered indexed BAM files used for visualization, as well as STAR alignment logs;
- tdfs_and_bws contain TDF files (to be visualized in IGV), and straind-specific, normalized bigWig files (can be visualized in almost any genome browser);
- strand and featureCounts contain the results of
featureCounts
quantification. The former contains tables for all possible strand-specificities, and the latter contains only the quantificaion for the correct strand; - exp_tables contains annotated and non-annotated expression tables in the form of raw counts and TPMs;
- stats contains statistics calculated per individual sample. A summary of these statistics is given in the final log.
Strand specificity is reported in the log:
Average percent of reads matching the coding strand: 95.419, lowest: 95.134, highest: 95.740
The strandedness of your experiment was determined to be FR
This RNA-seq experiment was done using a strand-specific protocol - particularly, the variety in which reads align to the same strand as the gene which gave rise to the read. Read number statistics are reported above:
Sample | Strain | All | rRNA | unmapped | multimap | 1_loc | assigned |
---|---|---|---|---|---|---|---|
474_InSPI2_rep1 | 474 | 14333158 | 10478217 | 687949 | 17933 | 3149059 | 2948423 |
474_InSPI2_rep2 | 474 | 15502120 | 11326243 | 498271 | 19149 | 3658457 | 3405101 |
474_InSPI2_rep3 | 474 | 13405916 | 9592274 | 679457 | 16818 | 3117367 | 2895934 |
D23_InSPI2_rep1 | D23 | 7785113 | 5747933 | 139763 | 11681 | 1885736 | 1773365 |
D23_InSPI2_rep2 | D23 | 7057773 | 5296264 | 184865 | 9703 | 1566941 | 1473220 |
D23_InSPI2_rep3 | D23 | 6534047 | 4859759 | 83644 | 9826 | 1580818 | 1478296 |
This shows us the raw read numbers (read pairs for paired-end experiments). In this case we see that the experiments were not fully rRNA-depleted, with 7-16 M initial reads and 1.5-3.4 M reads assigned to annotated genes. Percentages of the initial reads are given below:
Sample | Strain | rRNA | unmapped | multimap | 1_loc | assigned |
---|---|---|---|---|---|---|
474_InSPI2_rep1 | 474 | 73.105 | 4.800 | 0.125 | 21.970 | 20.571 |
474_InSPI2_rep2 | 474 | 73.063 | 3.214 | 0.124 | 23.600 | 21.965 |
474_InSPI2_rep3 | 474 | 71.553 | 5.068 | 0.125 | 23.254 | 21.602 |
D23_InSPI2_rep1 | D23 | 73.832 | 1.795 | 0.150 | 24.222 | 22.779 |
D23_InSPI2_rep2 | D23 | 75.042 | 2.619 | 0.137 | 22.202 | 20.874 |
D23_InSPI2_rep3 | D23 | 74.376 | 1.280 | 0.150 | 24.194 | 22.625 |
The percentage of reads that don't map to rRNA/tRNA (multimap + 1_loc) is pretty close to "assigned" - meaning that the annotation does not miss any strongly expressed features. A larger difference would indicated that the annotation is missing some strongly expressed features (most commonly, ncRNA).
After you've successfully completed the processing, you can use the obtained expression tables to get PCA, clustering, and differential expression analysis. A quick and interactive processing can be done in Phantasus. Taking annotated table of counts (Master_table.counts.tsv), we can upload the table to the web tool, normalize it (Adjust -> log2-transform -> quantile normalize), and run PCA analysis (Tools -> Plots -> PCA plot). This should give us the following plot, indicating clear separation of mutants from WT samples:
Quick differential expression using limma (Tools -> Differential expression -> limma) generates the following picture, identifying genes up- and down-regulated in misR mutants:
Overall, bacpipe
streamlines the processing of simple, single-strain RNA-seq experiments.