Skip to content

Commit

Permalink
Fixed Mahalanobis distance, tutorials (#12)
Browse files Browse the repository at this point in the history
- Implemented a variant of sq. Mahalanobis distance
with missing entries, see https://www.jstor.org/stable/3559861
on page 285, fixes #12
- Renamed `MahalanobisDistance` to `SquaredMahalanobisDistance`
- Minor adjustments to tutorials
- Removed extra dependencies
- Incremented version

Fixes #11 and #12
  • Loading branch information
thevolatilebit committed Dec 3, 2023
1 parent b50a3cb commit 85c17c1
Show file tree
Hide file tree
Showing 16 changed files with 116 additions and 91 deletions.
10 changes: 1 addition & 9 deletions Project.toml
Original file line number Diff line number Diff line change
@@ -1,18 +1,14 @@
name = "CEEDesigns"
uuid = "e939450b-799e-4198-a5f5-3f2f7fb1c671"
version = "0.3.5"
version = "0.3.6"

[deps]
Clustering = "aaaa29a8-35af-508c-8bc3-b662a17a0fe5"
Combinatorics = "861a8166-3701-5b0c-9a16-15d98fcdc6aa"
DataFrames = "a93c6f00-e57d-5684-b7b6-d8193f3e46c0"
HTTP = "cd3eb016-35fb-5094-929b-558a96fad6f3"
JSON = "682c06a0-de6a-54ab-a142-c8b1cf79cde6"
LibPQ = "194296ae-ab2e-5f79-8cd4-7183a0a5a0d1"
LinearAlgebra = "37e2e46d-f89d-539d-b4ee-838fcccc9c8e"
MCTS = "e12ccd36-dcad-5f33-8774-9175229e7b33"
MLJ = "add582a8-e3ab-11e8-2d5e-e98b27df1bc7"
POMDPSimulators = "e0d0a172-29c6-5d4e-96d0-f262df5d01fd"
POMDPTools = "7588e00f-9cae-40de-98dc-e0c70c48cdd7"
POMDPs = "a93abf59-7444-517b-a68a-c42f96afdd7d"
Plots = "91a5bcdd-55d7-5caf-9e0b-520d859cae80"
Expand All @@ -24,16 +20,12 @@ Statistics = "10745b16-79ce-11e8-11f9-7d13ad32a3b2"
StatsBase = "2913bbd2-ae8a-5f71-8c99-4fb6c76f3a91"

[compat]
Clustering = "0.15"
Combinatorics = "1.0"
DataFrames = "1.6"
HTTP = "1.10"
JSON = "0.21"
LibPQ = "1.17"
LinearAlgebra = "1.9"
MCTS = "0.5"
MLJ = "0.20"
POMDPSimulators = "0.3"
POMDPTools = "0.1"
POMDPs = "0.9"
Plots = "1.39"
Expand Down
2 changes: 1 addition & 1 deletion docs/src/api.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ CEEDesigns.GenerativeDesigns.efficient_value
CEEDesigns.GenerativeDesigns.DistanceBased
CEEDesigns.GenerativeDesigns.QuadraticDistance
CEEDesigns.GenerativeDesigns.DiscreteDistance
CEEDesigns.GenerativeDesigns.MahalanobisDistance
CEEDesigns.GenerativeDesigns.SquaredMahalanobisDistance
CEEDesigns.GenerativeDesigns.Exponential
```

Expand Down
4 changes: 2 additions & 2 deletions docs/src/tutorials/GenerativeDesigns.jl
Original file line number Diff line number Diff line change
Expand Up @@ -74,14 +74,14 @@ DistanceBased(
),
);

# You can also use the Mahalanobis distance (`MahalanobisDistance(; diagonal)`). As the Mahalanobis distance only works with numeric features, we have to select a few, along with the target variable. For example, we could write:
# You can also use the squared Mahalanobis distance (`SquaredMahalanobisDistance(; diagonal)`). As the squared Mahalanobis distance only works with numeric features, we have to select a few, along with the target variable. For example, we could write:

DistanceBased(
data[!, ["RestingBP", "MaxHR", "Cholesterol", "FastingBS", "HeartDisease"]];
target = "HeartDisease",
uncertainty = Entropy(),
similarity = Exponential(; λ = 5),
distance = MahalanobisDistance(; diagonal = 1),
distance = SquaredMahalanobisDistance(; diagonal = 1),
);

# The package offers an additional flexibility by allowing an experiment to yield readouts over multiple features at the same time. In our scenario, we can consider the features `RestingECG`, `Oldpeak`, `ST_Slope`, and `MaxHR` to be obtained from a single experiment `ECG`.
Expand Down
4 changes: 2 additions & 2 deletions docs/src/tutorials/GenerativeDesigns.md
Original file line number Diff line number Diff line change
Expand Up @@ -91,15 +91,15 @@ DistanceBased(
nothing #hide
````

You can also use the Mahalanobis distance (`MahalanobisDistance(; diagonal)`). As the Mahalanobis distance only works with numeric features, we have to select a few, along with the target variable. For example, we could write:
You can also use the squared Mahalanobis distance (`SquaredMahalanobisDistance(; diagonal)`). As the squared Mahalanobis distance only works with numeric features, we have to select a few, along with the target variable. For example, we could write:

````@example GenerativeDesigns
DistanceBased(
data[!, ["RestingBP", "MaxHR", "Cholesterol", "FastingBS", "HeartDisease"]];
target = "HeartDisease",
uncertainty = Entropy(),
similarity = Exponential(; λ = 5),
distance = MahalanobisDistance(; diagonal = 1),
distance = SquaredMahalanobisDistance(; diagonal = 1),
);
nothing #hide
````
Expand Down
31 changes: 18 additions & 13 deletions docs/src/tutorials/SimpleGenerative.jl
Original file line number Diff line number Diff line change
Expand Up @@ -80,19 +80,23 @@
# If there is no evidence associated with a new entity, we assign $w_{j}$ according to some prior distribution (uniform by default).
# Otherwise we use some particular distance and similarity metrics.
#
# For each feature $x\in X$, we consider a function $\rho_x$, which measures the distance between two outputs. By default, we consider:
# For each feature $x\in X$, we consider a function $\rho_x$, which measures the distance between two outputs. Please be aware that the "distances" discussed here do not conform to the mathematical definition of "metrics", even though they are functions derived from underlying metrics (i.e., a square of a metric). This is justified when considering how these "distances" are subsequently converted into probabilistic weights.
#
# By default, we consider the following distances:
# - Rescaled Kronecker delta (i.e., $\rho(x, y)=0$ only when $x=y$, and $\rho(x, y)= \lambda$ under any other circumstances, with $\lambda > 0$) for discrete features (i.e., features whose types are modeled as `MultiClass` type in [ScientificTypes.jl](https://github.com/JuliaAI/ScientificTypes.jl));
# - Rescaled squared distance $\rho(x, y) = λ \frac{(x - y)^2}{2\sigma^2}$, where $\sigma^2$ is the variance of the feature column, estimated with respect to the prior for continuous features.
# - Mahalanobis distance $\rho(x,y) = \sqrt{(x-y)^{⊤}\Sigma^{-1}(x-y)}$, where $\Sigma$ is the empirical covariance matrix of the historical data.
# - Rescaled ("standardized", by default) squared distance $\rho(x, y) = λ \frac{(x - y)^2}{\sigma^2}$, where $\sigma^2$ is the variance of the feature column, estimated with respect to the prior for continuous features.
# - Squared Mahalanobis distance $\rho(x,y) = (x-y)^{⊤}\Sigma^{-1}(x-y)$, where $\Sigma$ is the empirical covariance matrix of the historical data.
#
# Therefore, for distance metrics assuming independence of features (Kronecker delta and squared distance), given the new entity's experimental state with readouts over the feature set $F = \bigcup X_{e}$, where $e \in S$, we can calculate
# For distance metrics assuming independence of features (Kronecker delta and squared distance), given the new entity's experimental state with readouts over the feature set $F = \bigcup X_{e}$, where $e \in S$, we can calculate
# the distance from the $j$-th historical entity as $d_j = \sum_{x\in F} \rho_x (\hat x, x_j)$, where $\hat x$ and $x_j$ denote the readout
# for feature $x$ for the entity being tested and the entity recorded in $j$-th column.
# Mahalanobis distance directly takes in "rows", $\rho(\hat{x},x_{j})$.
#
# The squared Mahalanobis distance directly takes in "rows", $\rho(\hat{x},x_{j})$.
#
# Next, we convert distances $d_j$ into probabilistic weights $w_j$. By default, we use a rescaled exponential function, i.e.,
# we put $w_j = \exp(-\lambda d_j)$ for some $\lambda>0$. Notably, $\lambda$'s value determines how belief is distributed across the historical entities.
# we put $w_j = \exp(-\lambda d_j)$ with $\lambda=0.5$. Notably, $\lambda$'s value determines how belief is distributed across the historical entities.
# Larger values of $\lambda$ concentrate the belief tightly around the "closest" historical entities, while smaller values distribute more belief to more distant entities.
#
# Note that by choosing the squared Mahalanobis distance and the exponential function with a factor of $\lambda=0.5$, the resulting weight effectively equals the density of a [multivariate normal distribution](https://en.wikipedia.org/wiki/Multivariate_normal_distribution#Density_function) fitted to the historical data. A similar assertion applies when we use the sum of "standardized" squared distances instead. However, in this latter case, we "enforce" the independence of the features.
#
# The proper choice of distance and similarity metrics depends on insight into the dataset at hand. Weights can then be used to construct
# weighted histograms or density estimators for the posterior distributions of interest, or to directly resample historical rows.
Expand Down Expand Up @@ -228,7 +232,7 @@ plot(p1, p2, p3, p4; layout = (2, 2), legend = false)
# - `weights`: this represents a function of `evidence` that generates probability weights $w_j$ to each row in the historical data.
#
# By default, Euclidean distance is used as the distance metric. In the second
# call to `DistanceBased`, we instead use the Mahalanobis distance.
# call to `DistanceBased`, we instead use the squared Mahalanobis distance.
# It is possible to specify different distance metrics for each feature, see our
# [heart disease generative modeling](GenerativeDesigns.md) tutorial for more information.
# In both cases, the squared exponential function is used to convert distances
Expand All @@ -246,13 +250,14 @@ plot(p1, p2, p3, p4; layout = (2, 2), legend = false)
target = "y",
uncertainty = Variance(),
similarity = GenerativeDesigns.Exponential(; λ = 5),
distance = MahalanobisDistance(; diagonal = 0),
distance = SquaredMahalanobisDistance(; diagonal = 0),
);

# We can look at the uncertainty in $y$ for a state where a single
# feature is "observed" at its mean value. Note that uncertainty is generally higher
# for the Mahalanobis distance, which makes sense as it takes into account the
# non-independence of the features.
# feature is "observed" at its mean value. Note that uncertainty is generally lower for the squared Mahalanobis distance.
# As we consider only a single non-missing entry, this makes sense as the implemented variant of the squared Mahalanobis distance for missing values effectively multiplies
# the other, quadratic distance, by a factor greater than one. For more details, refer to page 285 of
# [Multivariate outlier detection in incomplete survey data: The epidemic algorithm and transformed rank correlations](https://www.jstor.org/stable/3559861).

data_uncertainties =
[i => uncertainty(Evidence(i => mean(data[:, i]))) for i in names(data)[1:4]]
Expand All @@ -279,7 +284,7 @@ p2 = sticks(

plot(p1, p2; layout = (1, 2), legend = false)

# We can view the posterior distribution $q(y|e_{S}$ conditioned on a state (here arbitrarily set to $S = e_{3}$, giving evidence for $x_{3}$).
# We can view the posterior distribution $q(y|e_{S})$ conditioned on a state (here arbitrarily set to $S = e_{3}$, giving evidence for $x_{3}$).

evidence = Evidence("x3" => mean(data.x3))
plot_weights = StatsBase.weights(weights(evidence))
Expand Down
29 changes: 17 additions & 12 deletions docs/src/tutorials/SimpleGenerative.md
Original file line number Diff line number Diff line change
Expand Up @@ -84,20 +84,24 @@ $q(e_{S^{\prime}}|e_{S})$.
If there is no evidence associated with a new entity, we assign $w_{j}$ according to some prior distribution (uniform by default).
Otherwise we use some particular distance and similarity metrics.

For each feature $x\in X$, we consider a function $\rho_x$, which measures the distance between two outputs. By default, we consider:
For each feature $x\in X$, we consider a function $\rho_x$, which measures the distance between two outputs. Please be aware that the "distances" discussed here do not conform to the mathematical definition of "metrics", even though they are functions derived from underlying metrics (i.e., a square of a metric). This is justified when considering how these "distances" are subsequently converted into probabilistic weights.

By default, we consider the following distances:
- Rescaled Kronecker delta (i.e., $\rho(x, y)=0$ only when $x=y$, and $\rho(x, y)= \lambda$ under any other circumstances, with $\lambda > 0$) for discrete features (i.e., features whose types are modeled as `MultiClass` type in [ScientificTypes.jl](https://github.com/JuliaAI/ScientificTypes.jl));
- Rescaled squared distance $\rho(x, y) = λ \frac{(x - y)^2}{2\sigma^2}$, where $\sigma^2$ is the variance of the feature column, estimated with respect to the prior for continuous features.
- Mahalanobis distance $\rho(x,y) = \sqrt{(x-y)^{⊤}\Sigma^{-1}(x-y)}$, where $\Sigma$ is the empirical covariance matrix of the historical data.
- Rescaled ("standardized", by default) squared distance $\rho(x, y) = λ \frac{(x - y)^2}{\sigma^2}$, where $\sigma^2$ is the variance of the feature column, estimated with respect to the prior for continuous features.
- Squared Mahalanobis distance $\rho(x,y) = (x-y)^{⊤}\Sigma^{-1}(x-y)$, where $\Sigma$ is the empirical covariance matrix of the historical data.

Therefore, for distance metrics assuming independence of features (Kronecker delta and squared distance), given the new entity's experimental state with readouts over the feature set $F = \bigcup X_{e}$, where $e \in S$, we can calculate
For distance metrics assuming independence of features (Kronecker delta and squared distance), given the new entity's experimental state with readouts over the feature set $F = \bigcup X_{e}$, where $e \in S$, we can calculate
the distance from the $j$-th historical entity as $d_j = \sum_{x\in F} \rho_x (\hat x, x_j)$, where $\hat x$ and $x_j$ denote the readout
for feature $x$ for the entity being tested and the entity recorded in $j$-th column.
Mahalanobis distance directly takes in "rows", $\rho(\hat{x},x_{j})$.
The squared Mahalanobis distance directly takes in "rows", $\rho(\hat{x},x_{j})$.

Next, we convert distances $d_j$ into probabilistic weights $w_j$. By default, we use a rescaled exponential function, i.e.,
we put $w_j = \exp(-\lambda d_j)$ for some $\lambda>0$. Notably, $\lambda$'s value determines how belief is distributed across the historical entities.
we put $w_j = \exp(-\lambda d_j)$ with $\lambda=0.5$. Notably, $\lambda$'s value determines how belief is distributed across the historical entities.
Larger values of $\lambda$ concentrate the belief tightly around the "closest" historical entities, while smaller values distribute more belief to more distant entities.

Note that by choosing the squared Mahalanobis distance and the exponential function with a factor of $\lambda=0.5$, the resulting weight effectively equals the density of a [multivariate normal distribution](https://en.wikipedia.org/wiki/Multivariate_normal_distribution#Density_function) fitted to the historical data. A similar assertion applies when we use the sum of "standardized" squared distances instead. However, in this latter case, we "enforce" the independence of the features.

The proper choice of distance and similarity metrics depends on insight into the dataset at hand. Weights can then be used to construct
weighted histograms or density estimators for the posterior distributions of interest, or to directly resample historical rows.

Expand Down Expand Up @@ -243,7 +247,7 @@ returns three functions needed by CEEDesigns.jl:
- `weights`: this represents a function of `evidence` that generates probability weights $w_j$ to each row in the historical data.

By default, Euclidean distance is used as the distance metric. In the second
call to `DistanceBased`, we instead use the Mahalanobis distance.
call to `DistanceBased`, we instead use the squared Mahalanobis distance.
It is possible to specify different distance metrics for each feature, see our
[heart disease generative modeling](GenerativeDesigns.md) tutorial for more information.
In both cases, the squared exponential function is used to convert distances
Expand All @@ -262,15 +266,16 @@ to weights.
target = "y",
uncertainty = Variance(),
similarity = GenerativeDesigns.Exponential(; λ = 5),
distance = MahalanobisDistance(; diagonal = 0),
distance = SquaredMahalanobisDistance(; diagonal = 0),
);
nothing #hide
````

We can look at the uncertainty in $y$ for a state where a single
feature is "observed" at its mean value. Note that uncertainty is generally higher
for the Mahalanobis distance, which makes sense as it takes into account the
non-independence of the features.
feature is "observed" at its mean value. Note that uncertainty is generally lower for the squared Mahalanobis distance.
As we consider only a single non-missing entry, this makes sense as the implemented variant of the squared Mahalanobis distance for missing values effectively multiplies
the other, quadratic distance, by a factor greater than one. For more details, refer to page 285 of
[Multivariate outlier detection in incomplete survey data: The epidemic algorithm and transformed rank correlations](https://www.jstor.org/stable/3559861).

````@example SimpleGenerative
data_uncertainties =
Expand Down Expand Up @@ -299,7 +304,7 @@ p2 = sticks(
plot(p1, p2; layout = (1, 2), legend = false)
````

We can view the posterior distribution $q(y|e_{S}$ conditioned on a state (here arbitrarily set to $S = e_{3}$, giving evidence for $x_{3}$).
We can view the posterior distribution $q(y|e_{S})$ conditioned on a state (here arbitrarily set to $S = e_{3}$, giving evidence for $x_{3}$).

````@example SimpleGenerative
evidence = Evidence("x3" => mean(data.x3))
Expand Down
4 changes: 2 additions & 2 deletions docs/src/tutorials/SimpleStatic.jl
Original file line number Diff line number Diff line change
Expand Up @@ -59,9 +59,9 @@
# ### Optimal Arrangements

# To find the optimal arrangement for each $S$ we need to know the cost of $O_{S}$. The monetary cost of $O_{S}$ is simply
# the sum of the costs of each experiment:$m_{O_{S}}=\sum_{e\in S} m_{e}$.
# the sum of the costs of each experiment: $m_{O_{S}}=\sum_{e\in S} m_{e}$.
# The total time required is the sum of the maximum time *of each partition*. This is because while each partition in the
# arrangement is done in serial, experiments within partitions are done in parallel. That is, $t_{O_{S}}=\sum_{i=1}^{l} \text{max} \{ t_{e}: e \in o_{i}\}$
# arrangement is done in serial, experiments within partitions are done in parallel. That is, $t_{O_{S}}=\sum_{i=1}^{l} \text{max} \{ t_{e}: e \in o_{i}\}$.
# Given these costs and a parameter $\lambda$ which controls the tradeoff between monetary cost and time, the combined
# cost of an arrangement is: $\lambda m_{O_{S}} + (1-\lambda) t_{O_{S}}$.
#
Expand Down
4 changes: 2 additions & 2 deletions docs/src/tutorials/SimpleStatic.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,9 +63,9 @@ $l=1$. Other arrangements fall between these extremes.
### Optimal Arrangements

To find the optimal arrangement for each $S$ we need to know the cost of $O_{S}$. The monetary cost of $O_{S}$ is simply
the sum of the costs of each experiment:$m_{O_{S}}=\sum_{e\in S} m_{e}$.
the sum of the costs of each experiment: $m_{O_{S}}=\sum_{e\in S} m_{e}$.
The total time required is the sum of the maximum time *of each partition*. This is because while each partition in the
arrangement is done in serial, experiments within partitions are done in parallel. That is, $t_{O_{S}}=\sum_{i=1}^{l} \text{max} \{ t_{e}: e \in o_{i}\}$
arrangement is done in serial, experiments within partitions are done in parallel. That is, $t_{O_{S}}=\sum_{i=1}^{l} \text{max} \{ t_{e}: e \in o_{i}\}$.
Given these costs and a parameter $\lambda$ which controls the tradeoff between monetary cost and time, the combined
cost of an arrangement is: $\lambda m_{O_{S}} + (1-\lambda) t_{O_{S}}$.

Expand Down
2 changes: 1 addition & 1 deletion src/GenerativeDesigns/GenerativeDesigns.jl
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ using MCTS
using ..CEEDesigns: front

export UncertaintyReductionMDP, DistanceBased
export QuadraticDistance, DiscreteDistance, MahalanobisDistance
export QuadraticDistance, DiscreteDistance, SquaredMahalanobisDistance
export Exponential
export Variance, Entropy
export Evidence, State
Expand Down
Loading

0 comments on commit 85c17c1

Please sign in to comment.