Skip to content

Subcommand: edpl

Lucas Czech edited this page Jan 4, 2022 · 13 revisions

Calcualte the Expected Distance between Placement Locations (EDPL) for all pqueries.

Usage: gappa examine edpl [options]

Options

Input
--jplace-path Required. TEXT:PATH(existing)=[] ...
List of jplace files or directories to process. For directories, only files with the extension .jplace[.gz] are processed.
Settings
--histogram-bins UINT=25
Number of histogram bins for binning the EDPL values.
--histogram-max FLOAT=-1
Maximum value to use in the histogram for binning the EDPL values. To use the maximal EDPL found in the samples, use a negative value (default).
--no-list-file FLAG
If set, do not write out the EDPL per pquery, but just the histogram file. As the list needs to keep all pquery names in memory (to get the correct order), the memory requirements might be too large. In that case, this option can help.
Output
--out-dir TEXT=.
Directory to write output files to.
--file-prefix TEXT
File prefix for output files. Most gappa commands use the command name as the base name for file output. This option amends the base name, to distinguish runs with different data.
--file-suffix TEXT
File suffix for output files. Most gappa commands use the command name as the base name for file output. This option amends the base name, to distinguish runs with different data.
Global Options
--allow-file-overwriting FLAG
Allow to overwrite existing output files instead of aborting the command.
--verbose FLAG
Produce more verbose output.
--threads UINT
Number of threads to use for calculations.
--log-file TEXT
Write all output to a log file, in addition to standard output to the terminal.

Description

Calculates the expected distance between placement locations (EDPL) for all pqueries in the given samples. The command is a re-implementation of guppy edpl, see there for more details.

Details

The EDPL is a measure of uncertainty of how far the placements of a pquery (query sequence) are spread across the branches of the reference tree. In a reference tree with similar sequences, a query sequence might be placed on several nearby branches with relatively high likelihood (LWR). This still constitutes a high confidence in the placement, as the spreading is due to the similar reference sequences, and not due to inherent uncertainty in the placement itself. This is opposed to a query sequence whose placements are spread all across the tree, which might indicate that a fitting reference sequence is missing from the tree, and hence yields uncertain placements.

This can be assessed with the EDPL, which calculates the distances between different placements, weighted by their respective LWRs:

Example of the EDPL for a pquery with three placement locations.

The p values in the figure represent likelihood weight ratios of the placements at these locations. The distances d are calculated using the branch lengths of the tree on the path between the placement locations. Hence, a low EDPL indicates that the placements of a pquery (query sequence) are focused in a narrow region of the tree, whereas a high EDPL indicates that the placements are spread across the tree.

See http://matsen.github.io/pplacer/generated_rst/guppy_edpl.html for more information.

The command produces two tables:

  • list.csv: A list of the EDPL for each pquery of each sample. The list contains four columns: Sample name (using the input file name), pquery name (one line for each name for pqueries with multiple names), the weight (multiplicity) of the pquery, and the EDPL value of that pquery. As this list needs quite some memory (about as much as the input jplace files), it can also be deactivated with --no-list-file.
  • histogram.csv: A summary histogram of the EDPL values. This can be used in spreadsheet tools to produce a graph that allows an overview of the values for easy assessment. Using the settings --histogram-bins and --histogram-max, the histogram output can be refined.

The histogram can for example be visualized as follows:

Example of an EDPL histogram.

The histogram shows the accumulated EDPL values: The x-axis are EDPLs, the y-axis shows how many of the query sequences have an EDPL at or below the respecive value. For example, the lowest bin indicates that more than 60% of the query sequences have an EDPL between 0.0 and 0.02.

Citation

When using this method, please do not forget to cite

Lucas Czech, Pierre Barbera, Alexandros Stamatakis. Genesis and Gappa: Processing, Analyzing and Visualizing Phylogenetic (Placement) Data. Bioinformatics, 2020. doi:10.1093/bioinformatics/btaa070

Frederick Matsen, Steven Evans. Edge Principal Components and Squash Clustering: Using the Special Structure of Phylogenetic Placement Data for Sample Comparison. PLOS ONE, 2013. doi:10.1371/journal.pone.0056859

Clone this wiki locally