Skip to content

Running recentrifuge for LMAT

Jose Manuel Martí edited this page Jan 3, 2019 · 8 revisions

Quick start

Let's suppose you have installed Recentrifuge with pip or cloned the repo in ~/recentrifuge, and you would like to analyze and compare the LMAT output from several samples. Provided you have used retaxdump to populate ./taxdump and you have each sample in one different subdirectory under the current one, you can put Recentrifuge to work with just the following line:

~/recentrifuge/rcf -l .

where you should exclude the ~/recentrifuge/ part for pip installations in this and future command-lines.

If you would like to include LMAT plasmids classifications in your results, please add the LMAT file plasmid.names.txt to the taxdump directory as detailed here.

Details

Automatic detection of LMAT outputs

The above-stated solution is Recentrifuge LMAT automatic mode: it searches for directories not starting with a dot under the current one and, in each of them, it will suppose that a set of LMAT output files resides (typically Long.file.name.ending.in_output??.out, where the ?? goes from 0 to the number of cores used by LMAT minus one). Then, Recentrifuge will load each sample in parallel, parsing sequentially the output files under each of those subdirectories. Obviously, this is the most convenient mode of operation of Recentrifuge for large sets of samples analyzed by LMAT, to avoid specifying them one by one.

In case you want to be more specific about the place of the samples to be processed, you can specify the directory that contains the output files for each sample. So, if the samples S1, S2 and S3, for instance, are in the directories ./setA/S1, ./setA/S2, ./setB/S3, respectively, the command would be:

~/recentrifuge/rcf -l ./setA/S1 -l ./setA/S2 -l ./setB/S3

Finally, if you happen to have several samples analyzed by LMAT in the same directory, you can select them one by one. For example, if the samples S10, S11 and S12 are together in the directory ./all_samples, the command would be:

~/recentrifuge/rcf -l ./all_samples/S10 -l ./all_samples/S11 -l ./all_samples/S12  

You can mix the two latter approaches, as Recentrifuge will detect if, for the LMAT output files to be parsed, you are selecting a whole directory or just a file prefix to be used inside a directory.

File format

Recentrifuge reads the direct outputs from LMAT and shows diverse descriptive statistics both about the reads and the LMAT classification process. Note that the output from LMAT is quite verbose and, in the case of large datasets, it could take some minutes to be processed by Recentrifuge. For example, for the LMAT analysis of the SRR829867 dataset using 32 cores and filtering reads classified with a LMAT score under -1, the Recentrifuge console output is:

Loading output file ~/SRR829867/SRR829867.fasta.lmat-4-14.20mer.db.lo.rl_output0.out...OK!
Loading output file ~/SRR829867/SRR829867.fasta.lmat-4-14.20mer.db.lo.rl_output1.out...OK!
(...)
Loading output file ~/SRR829867/SRR829867.fasta.lmat-4-14.20mer.db.lo.rl_output31.out...OK!
  Seqs read: 23_025_264	[2.33 Gnt]
  Seqs clas: 22_870_103	(0.67% unclassified)
  Seqs pass: 7_245_243	(68.32% rejected)
  DB Matching: Multi = 77.0%  Direct = 22.3%  ReadTooShort = 0.0%  LowScore = 0.0%  NoDbHits = 0.7%
  Scores: min = -1.0, max = 2.8, avr = -0.5
  Length: min = 101 nt, max = 101 nt, avr = 101 nt
  4345 taxa with assigned reads
/Users/martijm/local/lmat-scripts/SRR829867 sample OK!
Load elapsed time: 347 sec

More about plasmids

If you would like to include LMAT plasmids classifications in your results, please add the LMAT file plasmid.names.txt to the taxdump directory as detailed here. Recentrifuge will detect it and proceed to parse the plasmid data using regular expressions. Recentrifuge will output a summary for this parsing and for a sanity check of the plasmid data found:

Loading LMAT plasmids... OK! 
 Plasmid sanity check:  rejected (taxid error) = 286  rejected (parent error) = 2
 Plasmid pattern matching:  1st type = 4066   2nd type = 284   other = 0

If the debug flag is active (-g), Recentrifuge will give details:

Plasmid taxid ERROR! Taxid=294 already a NCBI taxid. Declared parent is 294 but NCBI parent is 136843.
	Plasmid details:  Pseudomonas fluorescens strain PC20 plasmid pNAH20, complete sequence[sequence_id 814379][seq_data_id 7244447][tax_node_id 294]
Plasmid taxid ERROR! Taxid=1318 already a NCBI taxid. Declared parent is 1318 but NCBI parent is 1301.
	Plasmid details:  Streptococcus parasanguinis plasmid pFW213, complete sequence[sequence_id 806688][seq_data_id 7037937][tax_node_id 1318]
(...)
Plasmid parent taxid ERROR! Taxid=2521 and parent=2521.
	Plasmid details:  Plasmid pADB201 (from Mycoplasma mycoides), complete genome[sequence_id 829559][seq_data_id 8131462][tax_node_id 2521]
(...)

Scoring scheme

Recentrifuge supports the specific scoring scheme of LMAT, which is the default and only choice for LMAT outputs. The minscore parameter works for the final LMAT score of the read. LMAT classification scores between 0 and -1 are considered not very reliable, and under -1 the classifications should be used with caution.

Advanced example

As a more complete example, to analyse the the LMAT output:

  • with the taxonomy files downloaded to /my/tax/dir,
  • from samples X1 (in directory ./A/X1), X2 (in directory ./B/X1), X3 (in directory ./C/X3) and X4 (in directory ./C/X4),
  • with ONE negative control (in directory ./CTRL),
  • excluding reads with a LMAT classification score under -1.0,
  • excluding taxa (un)assigned to cellular organisms (taxid 131567) and assigned to chordata (taxid 7711),
  • with the output Excel in cmplexCruncher format,
  • and saving the output to Xsamples.rcf.html file,

the command would be:

~/recentrifuge/rcf -n /my/tax/dir -l ./CTRL -l ./A/X1 -l ./B/X2 -l ./C/X3 -l ./D/X4 -c 1 -y "-1.0" -x 131567 -x 7711 -e CMPLXCRUNCHER -o Xsamples.rcf.html

The complete guide to rcf options and flags is in the Recentrifuge command line page.