Subcommand: edpl

Calcualte the Expected Distance between Placement Locations (EDPL) for all pqueries.

Usage: gappa examine edpl [options]

Options

Input
`--jplace-path`	Required. `TEXT:PATH(existing)=[] ...` List of jplace files or directories to process. For directories, only files with the extension `.jplace[.gz]` are processed.
Settings
`--histogram-bins`	`UINT=25` Number of histogram bins for binning the EDPL values.
`--histogram-max`	`FLOAT=-1` Maximum value to use in the histogram for binning the EDPL values. To use the maximal EDPL found in the samples, use a negative value (default).
`--no-list-file`	`FLAG` If set, do not write out the EDPL per pquery, but just the histogram file. As the list needs to keep all pquery names in memory (to get the correct order), the memory requirements might be too large. In that case, this option can help.
Output
`--out-dir`	`TEXT=.` Directory to write output files to.
`--file-prefix`	`TEXT` File prefix for output files. Most gappa commands use the command name as the base name for file output. This option amends the base name, to distinguish runs with different data.
`--file-suffix`	`TEXT` File suffix for output files. Most gappa commands use the command name as the base name for file output. This option amends the base name, to distinguish runs with different data.
Global Options
`--allow-file-overwriting`	`FLAG` Allow to overwrite existing output files instead of aborting the command.
`--verbose`	`FLAG` Produce more verbose output.
`--threads`	`UINT` Number of threads to use for calculations.
`--log-file`	`TEXT` Write all output to a log file, in addition to standard output to the terminal.

Description

Calculates the expected distance between placement locations (EDPL) for all pqueries in the given samples. The command is a re-implementation of guppy edpl, see there for more details.

Details

The EDPL is a measure of uncertainty of how far the placements of a pquery (query sequence) are spread across the branches of the reference tree. In a reference tree with similar sequences, a query sequence might be placed on several nearby branches with relatively high likelihood (LWR). This still constitutes a high confidence in the placement, as the spreading is due to the similar reference sequences, and not due to inherent uncertainty in the placement itself. This is opposed to a query sequence whose placements are spread all across the tree, which might indicate that a fitting reference sequence is missing from the tree, and hence yields uncertain placements.

This can be assessed with the EDPL, which calculates the distances between different placements, weighted by their respective LWRs:

Example of the EDPL for a pquery with three placement locations.

The p values in the figure represent likelihood weight ratios of the placements at these locations. The distances d are calculated using the branch lengths of the tree on the path between the placement locations. Hence, a low EDPL indicates that the placements of a pquery (query sequence) are focused in a narrow region of the tree, whereas a high EDPL indicates that the placements are spread across the tree.

See http://matsen.github.io/pplacer/generated_rst/guppy_edpl.html for more information.

The command produces two tables:

list.csv: A list of the EDPL for each pquery of each sample. The list contains four columns: Sample name (using the input file name), pquery name (one line for each name for pqueries with multiple names), the weight (multiplicity) of the pquery, and the EDPL value of that pquery. As this list needs quite some memory (about as much as the input jplace files), it can also be deactivated with --no-list-file.
histogram.csv: A summary histogram of the EDPL values. This can be used in spreadsheet tools to produce a graph that allows an overview of the values for easy assessment. Using the settings --histogram-bins and --histogram-max, the histogram output can be refined.

The histogram can for example be visualized as follows:

Example of an EDPL histogram.

The histogram shows the accumulated EDPL values: The x-axis are EDPLs, the y-axis shows how many of the query sequences have an EDPL at or below the respecive value. For example, the lowest bin indicates that more than 60% of the query sequences have an EDPL between 0.0 and 0.02.

Citation

When using this method, please do not forget to cite

Lucas Czech, Pierre Barbera, Alexandros Stamatakis. Genesis and Gappa: Processing, Analyzing and Visualizing Phylogenetic (Placement) Data. Bioinformatics, 2020. doi:10.1093/bioinformatics/btaa070

Frederick Matsen, Steven Evans. Edge Principal Components and Squash Clustering: Using the Special Structure of Phylogenetic Placement Data for Sample Comparison. PLOS ONE, 2013. doi:10.1371/journal.pone.0056859

Home

Citation and References

General Usage

Phylogenetic Placement

Module analyze

Module edit

Module examine

Module prepare

Module simulate

Module tools

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Subcommand: edpl

Options

Description

Details

Citation

Clone this wiki locally