Skip to content

eastmallingresearch/Metabarcoding_pipeline

 
 

Repository files navigation

Index

  1. Description
  2. HMM Preperation for ITS analysis
  3. Common workflow
  4. Bacterial 16S workflow
  5. Fungal ITS workflow
  6. Oomycete ITS workflow
  7. Nematode 18S workflow
  8. Statistical analysis

Description

Metabarcoding pipeline for Illumina MiSeq data. This pipeline is base in the main part on usearch (v10 currently) 32 bit (http://www.drive5.com/usearch/) and is designed to run on a Slurm cluster (orginally Sun grid engine - there may be a bit of legacy code hanging around)

Tested for bacterial 16S, fungal ITS, Oomycete ITS and Nematode 18S amplicons

Setup pipeline

# set MBPL variable to pipeline folder
MBPL=~/metabarcoding_pipeline
# to set permanetly for future (bash) shell sessions (be careful with this, if you have settings in ~/.profile they will no longer load)
echo export MBPL=~/metabarcoding_pipeline >>~/.bash_profile

HMM Preperation for ITS analysis

#f03c15 NOTE: I no longer bother with removal of the none ITS regions from the raw sequence - it is still useful to remove the none ITS regions from the final OTUs before assigning taxonomy.

The ITS pipelines are more involved and include scripts for removing common regions of 18S (SSU, 5.8S and LSU). The current implementation uses hidden markov models provided with ITSx (http://microbiology.se/software/itsx/) of these regions and HHMMER v 3.1b2 (http://hmmer.janelia.org/) to find them within the seqeunces. Scripts are provided to then remove the regions.

The HMM files need to be pepared before using the pipeline

The script cut_hmm.pl splits the combined hmm into individual files, which allows the four parts (ssu end, 58S start, 58 end and lsu start) to be in their own files. This allows a roughly three to four fold speed increase for the whole pipeline.

NOTE: This is dependent on the hmm files downloaded from ITSx - if they change the format it may no longer work.

perl $MBPL/scripts/cut_hmm.pl $MBPL/hmm/F.hmm $MBPL/hmm/chopped_hmm Fungi

cd $MBPL/hmm/chopped_hmm

cat *SSU*> t1
cat *58S_start* > t2
cat *58S_end* > t3
cat *LSU* > t4

hmmconvert t1 > $MBPL/hmm/Fun/ssu_end.hmm
hmmconvert t2 > $MBPL/hmm/Fun/58s_end.hmm
hmmconvert t3 > $MBPL/hmm/Fun/58s_start.hmm
hmmconvert t4 > $MBPL/hmm/Fun/lsu_start.hmm

cd $MBPL/hmm/Fun
rm -r $MBPL/hmm/chopped_hmm

for f in *.hmm
do
	sed -i -e'/^LENG/a MAXL  90' $f
done

hmmpress ssu_end.hmm
hmmpress 58s_end.hmm
hmmpress 58s_start.hmm
hmmpress lsu_start.hmm

NOTES

Files copied to $MBPL/hmm
Repeat for O.hmm for oomycetes (or for any of the other HMMs you want to include) and set Fungi to Oomycota in the call to cut_hmm.pl . The output files from hmmpress will need to be copied to another location e.g. $MBL/hmm/OO

Taxonomy reference databases

Assigning taxonomy to OTUs requires a reference database(s) and these will need to be configured for use with the pipeline.

The Unite V7 (fungi) and RDP trainset 15 (bacteria) reference database were downloaded from
http://drive5.com/usearch/manual/utax_downloads.html
Configuration has been tested with usearch8.1 (probably works the same for 10)

usearch8.1 -makeudb_utax refdb.fa -output 16s_ref.udb -report 16s_report.txt
usearch8.1 -makeudb_utax refdb.fa -utax_trainlevels kpcofgs ‑utax_splitlevels NVpcofgs -output ITS_ref.udb -report ITS_report.txt

Sintax

ITS (fungi) Unite database v 20 download
https://files.plutof.ut.ee/public/orig/E8/83/E883EB19E3EA7B64C1F652521301239831FAFE0BFF015C9E2B4786DC0976C0FC.gz

To convert to a Sintax db (usearch 11.x) run:

gunzip E883EB19E3EA7B64C1F652521301239831FAFE0BFF015C9E2B4786DC0976C0FC.gz
mv E883EB19E3EA7B64C1F652521301239831FAFE0BFF015C9E2B4786DC0976C0FC.gz unite.fa
sed -ie 's/[dpcofg]:,//g' unite.fa
usearch -tax_stats taxdb.fa -log stats.txt
usearch -makeudb_usearch unite.fa -output unite_ITS.udb

Obviously names will change with updates of the unite database

Planta

It can be useful to add plant species into the databases if searching for endophytes (to exclude the plant species form results Subsets of plant ssu data can be downloaded from the silva database. A few adjustments need to be made to the headers.

cat *.fasta|
sed 's/c:unknown,//'|
sed 's/;$//'|
awk '{
	if (match($0, /s:.*/)) { 
		x=substr($0, RSTART, RLENGTH);++c[x];
		if (c[x]>1) {print $0," PL_"c[x]} 
		else {print $0}
	} 
	else {print $0}
}' > plantae.fa

Oomycota

The oomycota database combines three sets of data; 1) a subset (stamenopiles) of the silva_ssu database https://www.arb-silva.de/browser/, 2) a supplied 3rd party database 3) and a usearch specific Unite fungi database (donwloadable from the Unite website)

The silva_ssu and 3rd party databases required slight modification before use with usearch

# convert silva_ssu multiline to single line fasta
awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);}  END {printf("\n");}' < arb-silva.de_2022-07-29_id1192699_tax_silva.fasta | tail -n +2 >silva.fa

# rename taxonomy to be compatible withe other databases
sed -e 's/ Eukaryota;SAR;Stramenopiles;Peronosporomycetes;/;tax=d:SAR;p:Oomycota;s:/' silva.fa > oomycota.silva.fa
sed -ie 's/s:Achlya;/o:Saprolegniales;f:Saprolegniaceae;g:Achlya;s:/' oomycota.silva.fa
sed -ie 's/s:Achlya sparrowii/o:Saprolegniales;f:Saprolegniaceae;g:Achlya;s:Achlya sparrowii/' oomycota.silva.fa
sed -ie  's/s:Aphanomyces;/o:Saprolegniales;f:Saprolegniaceae;g:Aphanomyces;s:/'  oomycota.silva.fa
sed -ie 's/s:Aplanopsis;/o:Saprolegniales;f:Saprolegniaceae;g:Aplanopsis;s:/' oomycota.silva.fa
sed -ie 's/s:Apodachlya;/o:Leptomitales;c:Leptomitales incertae sedis;g:Apodachlya;s:/'  oomycota.silva.fa
sed -ie 's/s:Atkinsiella;/o:Lagenidiales;f:Haliphthoraceae;g:Atkinsiella;s:/' oomycota.silva.fa
sed -ie 's/s:Eurychasma;/o:Myzocytiopsidales;f:Eurychasmataceae;g:Eurychasma;s:/' oomycota.silva.fa
sed -ie 's/s:Haliphthoros;/o:Lagenidiales;f:Haliphthoraceae;g:Haliphthoros;s:/'  oomycota.silva.fa
sed -ie 's/s:Halodaphnea;/o:Lagenidiales;f:Haliphthoraceae;g:Halodaphnea;s:/' oomycota.silva.fa
sed -ie 's/s:Halophytophthora;/o:Pythiales;f:Pythiaceae;g:Halophytophthora;s:/'  oomycota.silva.fa
sed -ie 's/s:Haptoglossa;/o:Oomycota incertae sedis;g:Haptoglossa;s:/'  oomycota.silva.fa
sed -ie 's/s:Lagenidium;/o:Lagenidiales;f:Lagenidiaceae;g:Lagenidium;s:/'  oomycota.silva.fa
sed -ie 's/s:Leptolegnia;/o:Saprolegniales;f:Saprolegniaceae;g:Leptolegnia;s:/'  oomycota.silva.fa
sed -ie 's/s:Myzocytiopsis;/o:Myzocytiopsidales;f:Myzocytiopsidaceae;g:Myzocytiopsis;s:/' oomycota.silva.fa
sed -ie 's/s:Olpidiopsis;/o:Olpidiopsidales;f:Olpidiopsidaceae;g:Olpidiopsis;s:/' oomycota.silva.fa
sed -ie 's/s:Phytophthora;/o:Peronosporales;f:Peronosporaceae;g:Phytophthora;s:/' oomycota.silva.fa
sed -ie 's/s:Phytopythium;/o:Pythiales;f:Pythiaceae;g:Phytopythium;s:/' oomycota.silva.fa
sed -ie 's/s:Pythium;/o:Pythiales;f:Pythiaceae;g:Pythium;s:/' oomycota.silva.fa
sed -ie 's/s:Salispina;/o:Pythiales;f:Pythiaceae;g:Salispina;s:/' oomycota.silva.fa
sed -ie 's/s:Saprolegnia;/o:Saprolegniales;f:Saprolegniaceae;g:Saprolegnia;s:/' oomycota.silva.fa
sed -ie 's/s:Sapromyces;/o:Rhipidiales;f:Rhipidiaceae;g:Saprolegnia;s:/' oomycota.silva.fa
sed -ie 's/s:uncultured;/g:uncultured;s:/' oomycota.silva.fa

sed -ie 's/;/,/2g'  oomycota.silva.fa

usearch -tax_stats oomycota.silva.fa -log stats.txt

# combine and replace fasta headers with headers including full taxonomy
awk -F";" 'NR==FNR{a[$1]=$0;next;}a[$1]{$0=a[$1]}1' Oomycota.txt Oomycota.fasta > oomycetes_taxonomy.fa
sed -ie 's/k:SAR,p:Heterokontophyta,c:/d:SAR,p:/' oomycetes_taxonomy.fa

# append files and remove duplicates (append +1 to duplicated species) and naming issues

cat oomycetes_taxonomy.fa  oomycota.silva.fa|
sed 's/f:Pythiaceae,g:Phytophthora/f:Peronosporaceae,g:Phytophthora/'|
sed 's/o:Rhipidiales,f:Rhipidiaceae/o:Saprolegniales,f:Saprolegniaceae/'|
awk '{
	if (match($0, /s:.*/)) { 
		x=substr($0, RSTART, RLENGTH);++c[x];
		if (c[x]>1) {print $0," OO_"c[x]} 
		else {print $0}
	} 
	else {print $0}
}' > oomycota.fa

cat oomycota.fa plantae.fa unite_8.3.fasta > oo_fungi_plant.fa

usearch -makeudb_usearch oo_fungi_plant.fa -output OOM_ref.udb

Nematodes

Nematode database is also a subset of Silva_ssu
Again note that as the required section of the database can be downloaded directly the below (prior to the usearch command) may not be required

awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);}  END {printf("\n");}' < SILVA_123_SSURef_Nr99_tax_silva.fasta | tail -n +2 >silva.fa

grep Nematoda -A1 --no-group-separator silva.fa | sed -e 's/ Eukaryota;Opisthokonta;Holozoa;Metazoa (Animalia);Eumetazoa;Bilateria;/;tax=k:Metazoa;p:/'| \
awk -F";" '{
	if(NF>1){
		if(NF==5) {print $1";"$2";" $3";c:"$4";s:"$5}
		if(NF==6) {print $1";"$2";" $3";c:"$4";o:"$5";s:"$6}
		if(NF==7) {print $1";"$2";" $3";c:"$4";o:"$6";s:"$7}
		
	} else {print $1}
}'  > nem_tax.fasta

grep Eumetazoa -A1 --no-group-separator silva.fa > Eumetazoa.fa

grep -n -A1 Nematoda Eumetazoa.fa | \
sed -n 's/^\([0-9]\{1,\}\).*/\1d/p' | \
sed -f - Eumetazoa.fa|awk -F";" '{if(NF>1){print $1";tax=k:Metazoa;p:"$6}else {print $1}}' > nonem_tax.fasta

cat nem_tax.fasta nonem_tax.fasta > Eumetazoa_tax.fasta

usearch -makeudb_sintax nem_tax.fasta -output NEM_ref.udp
usearch -makeudb_sintax Eumetazoa_tax.fasta -output xNEM_ref.udp

Releases

No releases published

Packages

No packages published

Languages

  • R 73.2%
  • Perl 13.6%
  • Shell 13.2%