Fixed Mahalanobis distance, tutorials (#12)

- Implemented a variant of sq. Mahalanobis distance with missing entries, see https://www.jstor.org/stable/3559861 on page 285, fixes #12 - Renamed `MahalanobisDistance` to `SquaredMahalanobisDistance` - Minor adjustments to tutorials - Removed extra dependencies - Incremented version Fixes #11 and #12
Merck · Dec 3, 2023 · 85c17c1 · 85c17c1
1 parent b50a3cb
commit 85c17c1
Show file tree

Hide file tree

Showing 16 changed files with 116 additions and 91 deletions.
diff --git a/Project.toml b/Project.toml
@@ -1,18 +1,14 @@
 name = "CEEDesigns"
 uuid = "e939450b-799e-4198-a5f5-3f2f7fb1c671"
-version = "0.3.5"
+version = "0.3.6"
 
 [deps]
-Clustering = "aaaa29a8-35af-508c-8bc3-b662a17a0fe5"
 Combinatorics = "861a8166-3701-5b0c-9a16-15d98fcdc6aa"
 DataFrames = "a93c6f00-e57d-5684-b7b6-d8193f3e46c0"
-HTTP = "cd3eb016-35fb-5094-929b-558a96fad6f3"
 JSON = "682c06a0-de6a-54ab-a142-c8b1cf79cde6"
-LibPQ = "194296ae-ab2e-5f79-8cd4-7183a0a5a0d1"
 LinearAlgebra = "37e2e46d-f89d-539d-b4ee-838fcccc9c8e"
 MCTS = "e12ccd36-dcad-5f33-8774-9175229e7b33"
 MLJ = "add582a8-e3ab-11e8-2d5e-e98b27df1bc7"
-POMDPSimulators = "e0d0a172-29c6-5d4e-96d0-f262df5d01fd"
 POMDPTools = "7588e00f-9cae-40de-98dc-e0c70c48cdd7"
 POMDPs = "a93abf59-7444-517b-a68a-c42f96afdd7d"
 Plots = "91a5bcdd-55d7-5caf-9e0b-520d859cae80"
@@ -24,16 +20,12 @@ Statistics = "10745b16-79ce-11e8-11f9-7d13ad32a3b2"
 StatsBase = "2913bbd2-ae8a-5f71-8c99-4fb6c76f3a91"
 
 [compat]
-Clustering = "0.15"
 Combinatorics = "1.0"
 DataFrames = "1.6"
-HTTP = "1.10"
 JSON = "0.21"
-LibPQ = "1.17"
 LinearAlgebra = "1.9"
 MCTS = "0.5"
 MLJ = "0.20"
-POMDPSimulators = "0.3"
 POMDPTools = "0.1"
 POMDPs = "0.9"
 Plots = "1.39"

diff --git a/docs/src/api.md b/docs/src/api.md
@@ -33,7 +33,7 @@ CEEDesigns.GenerativeDesigns.efficient_value
 CEEDesigns.GenerativeDesigns.DistanceBased
 CEEDesigns.GenerativeDesigns.QuadraticDistance
 CEEDesigns.GenerativeDesigns.DiscreteDistance
-CEEDesigns.GenerativeDesigns.MahalanobisDistance
+CEEDesigns.GenerativeDesigns.SquaredMahalanobisDistance
 CEEDesigns.GenerativeDesigns.Exponential
 ```
 

diff --git a/docs/src/tutorials/GenerativeDesigns.jl b/docs/src/tutorials/GenerativeDesigns.jl
@@ -74,14 +74,14 @@ DistanceBased(
     ),
 );
 
-# You can also use the Mahalanobis distance (`MahalanobisDistance(; diagonal)`). As the Mahalanobis distance only works with numeric features, we have to select a few, along with the target variable. For example, we could write:
+# You can also use the squared Mahalanobis distance (`SquaredMahalanobisDistance(; diagonal)`). As the squared Mahalanobis distance only works with numeric features, we have to select a few, along with the target variable. For example, we could write:
 
 DistanceBased(
     data[!, ["RestingBP", "MaxHR", "Cholesterol", "FastingBS", "HeartDisease"]];
     target = "HeartDisease",
     uncertainty = Entropy(),
     similarity = Exponential(; λ = 5),
-    distance = MahalanobisDistance(; diagonal = 1),
+    distance = SquaredMahalanobisDistance(; diagonal = 1),
 );
 
 # The package offers an additional flexibility by allowing an experiment to yield readouts over multiple features at the same time. In our scenario, we can consider the features `RestingECG`, `Oldpeak`, `ST_Slope`, and `MaxHR` to be obtained from a single experiment `ECG`.

diff --git a/docs/src/tutorials/GenerativeDesigns.md b/docs/src/tutorials/GenerativeDesigns.md
@@ -91,15 +91,15 @@ DistanceBased(
 nothing #hide
 ````
 
-You can also use the Mahalanobis distance (`MahalanobisDistance(; diagonal)`). As the Mahalanobis distance only works with numeric features, we have to select a few, along with the target variable. For example, we could write:
+You can also use the squared Mahalanobis distance (`SquaredMahalanobisDistance(; diagonal)`). As the squared Mahalanobis distance only works with numeric features, we have to select a few, along with the target variable. For example, we could write:
 
 ````@example GenerativeDesigns
 DistanceBased(
     data[!, ["RestingBP", "MaxHR", "Cholesterol", "FastingBS", "HeartDisease"]];
     target = "HeartDisease",
     uncertainty = Entropy(),
     similarity = Exponential(; λ = 5),
-    distance = MahalanobisDistance(; diagonal = 1),
+    distance = SquaredMahalanobisDistance(; diagonal = 1),
 );
 nothing #hide
 ````

diff --git a/docs/src/tutorials/SimpleGenerative.jl b/docs/src/tutorials/SimpleGenerative.jl
@@ -80,19 +80,23 @@
 # If there is no evidence associated with a new entity, we assign $w_{j}$ according to some prior distribution (uniform by default).
 # Otherwise we use some particular distance and similarity metrics.
 # 
-# For each feature $x\in X$, we consider a function $\rho_x$, which measures the distance between two outputs. By default, we consider:
+# For each feature $x\in X$, we consider a function $\rho_x$, which measures the distance between two outputs. Please be aware that the "distances" discussed here do not conform to the mathematical definition of "metrics", even though they are functions derived from underlying metrics (i.e., a square of a metric). This is justified when considering how these "distances" are subsequently converted into probabilistic weights.
+#
+# By default, we consider the following distances:
 # - Rescaled Kronecker delta (i.e., $\rho(x, y)=0$ only when $x=y$, and $\rho(x, y)= \lambda$ under any other circumstances, with $\lambda > 0$) for discrete features (i.e., features whose types are modeled as `MultiClass` type in [ScientificTypes.jl](https://github.com/JuliaAI/ScientificTypes.jl));
-# - Rescaled squared distance $\rho(x, y) = λ  \frac{(x - y)^2}{2\sigma^2}$, where $\sigma^2$ is the variance of the feature column, estimated with respect to the prior for continuous features.
-# - Mahalanobis distance $\rho(x,y) = \sqrt{(x-y)^{⊤}\Sigma^{-1}(x-y)}$, where $\Sigma$ is the empirical covariance matrix of the historical data.
+# - Rescaled ("standardized", by default) squared distance $\rho(x, y) = λ \frac{(x - y)^2}{\sigma^2}$, where $\sigma^2$ is the variance of the feature column, estimated with respect to the prior for continuous features.
+# - Squared Mahalanobis distance $\rho(x,y) = (x-y)^{⊤}\Sigma^{-1}(x-y)$, where $\Sigma$ is the empirical covariance matrix of the historical data.
 # 
-# Therefore, for distance metrics assuming independence of features (Kronecker delta and squared distance), given the new entity's experimental state with readouts over the feature set $F = \bigcup X_{e}$, where $e \in S$, we can calculate
+# For distance metrics assuming independence of features (Kronecker delta and squared distance), given the new entity's experimental state with readouts over the feature set $F = \bigcup X_{e}$, where $e \in S$, we can calculate
 # the distance from the $j$-th historical entity as $d_j = \sum_{x\in F} \rho_x (\hat x, x_j)$, where $\hat x$ and $x_j$ denote the readout 
 # for feature $x$ for the entity being tested and the entity recorded in $j$-th column.
-# Mahalanobis distance directly takes in "rows", $\rho(\hat{x},x_{j})$.
-# 
+# The squared Mahalanobis distance directly takes in "rows", $\rho(\hat{x},x_{j})$.
+#
 # Next, we convert distances $d_j$ into probabilistic weights $w_j$. By default, we use a rescaled exponential function, i.e., 
-# we put $w_j = \exp(-\lambda d_j)$ for some $\lambda>0$. Notably, $\lambda$'s value determines how belief is distributed across the historical entities. 
+# we put $w_j = \exp(-\lambda d_j)$ with $\lambda=0.5$. Notably, $\lambda$'s value determines how belief is distributed across the historical entities. 
 # Larger values of $\lambda$ concentrate the belief tightly around the "closest" historical entities, while smaller values distribute more belief to more distant entities.
+#
+# Note that by choosing the squared Mahalanobis distance and the exponential function with a factor of $\lambda=0.5$, the resulting weight effectively equals the density of a [multivariate normal distribution](https://en.wikipedia.org/wiki/Multivariate_normal_distribution#Density_function) fitted to the historical data. A similar assertion applies when we use the sum of "standardized" squared distances instead. However, in this latter case, we "enforce" the independence of the features.
 # 
 # The proper choice of distance and similarity metrics depends on insight into the dataset at hand. Weights can then be used to construct
 # weighted histograms or density estimators for the posterior distributions of interest, or to directly resample historical rows.
@@ -228,7 +232,7 @@ plot(p1, p2, p3, p4; layout = (2, 2), legend = false)
 # - `weights`: this represents a function of `evidence` that generates probability weights $w_j$ to each row in the historical data.
 # 
 # By default, Euclidean distance is used as the distance metric. In the second
-# call to `DistanceBased`, we instead use the Mahalanobis distance.
+# call to `DistanceBased`, we instead use the squared Mahalanobis distance.
 # It is possible to specify different distance metrics for each feature, see our
 # [heart disease generative modeling](GenerativeDesigns.md) tutorial for more information.
 # In both cases, the squared exponential function is used to convert distances
@@ -246,13 +250,14 @@ plot(p1, p2, p3, p4; layout = (2, 2), legend = false)
     target = "y",
     uncertainty = Variance(),
     similarity = GenerativeDesigns.Exponential(; λ = 5),
-    distance = MahalanobisDistance(; diagonal = 0),
+    distance = SquaredMahalanobisDistance(; diagonal = 0),
 );
 
 # We can look at the uncertainty in $y$ for a state where a single
-# feature is "observed" at its mean value. Note that uncertainty is generally higher
-# for the Mahalanobis distance, which makes sense as it takes into account the 
-# non-independence of the features.
+# feature is "observed" at its mean value. Note that uncertainty is generally lower for the squared Mahalanobis distance.
+# As we consider only a single non-missing entry, this makes sense as the implemented variant of the squared Mahalanobis distance for missing values effectively multiplies
+# the other, quadratic distance, by a factor greater than one. For more details, refer to page 285 of
+# [Multivariate outlier detection in incomplete survey data: The epidemic algorithm and transformed rank correlations](https://www.jstor.org/stable/3559861).
 
 data_uncertainties =
     [i => uncertainty(Evidence(i => mean(data[:, i]))) for i in names(data)[1:4]]
@@ -279,7 +284,7 @@ p2 = sticks(
 
 plot(p1, p2; layout = (1, 2), legend = false)
 
-# We can view the posterior distribution $q(y|e_{S}$ conditioned on a state (here arbitrarily set to $S = e_{3}$, giving evidence for $x_{3}$).
+# We can view the posterior distribution $q(y|e_{S})$ conditioned on a state (here arbitrarily set to $S = e_{3}$, giving evidence for $x_{3}$).
 
 evidence = Evidence("x3" => mean(data.x3))
 plot_weights = StatsBase.weights(weights(evidence))

diff --git a/docs/src/tutorials/SimpleGenerative.md b/docs/src/tutorials/SimpleGenerative.md
@@ -84,20 +84,24 @@ $q(e_{S^{\prime}}|e_{S})$.
 If there is no evidence associated with a new entity, we assign $w_{j}$ according to some prior distribution (uniform by default).
 Otherwise we use some particular distance and similarity metrics.
 
-For each feature $x\in X$, we consider a function $\rho_x$, which measures the distance between two outputs. By default, we consider:
+For each feature $x\in X$, we consider a function $\rho_x$, which measures the distance between two outputs. Please be aware that the "distances" discussed here do not conform to the mathematical definition of "metrics", even though they are functions derived from underlying metrics (i.e., a square of a metric). This is justified when considering how these "distances" are subsequently converted into probabilistic weights.
+
+By default, we consider the following distances:
 - Rescaled Kronecker delta (i.e., $\rho(x, y)=0$ only when $x=y$, and $\rho(x, y)= \lambda$ under any other circumstances, with $\lambda > 0$) for discrete features (i.e., features whose types are modeled as `MultiClass` type in [ScientificTypes.jl](https://github.com/JuliaAI/ScientificTypes.jl));
-- Rescaled squared distance $\rho(x, y) = λ  \frac{(x - y)^2}{2\sigma^2}$, where $\sigma^2$ is the variance of the feature column, estimated with respect to the prior for continuous features.
-- Mahalanobis distance $\rho(x,y) = \sqrt{(x-y)^{⊤}\Sigma^{-1}(x-y)}$, where $\Sigma$ is the empirical covariance matrix of the historical data.
+- Rescaled ("standardized", by default) squared distance $\rho(x, y) = λ \frac{(x - y)^2}{\sigma^2}$, where $\sigma^2$ is the variance of the feature column, estimated with respect to the prior for continuous features.
+- Squared Mahalanobis distance $\rho(x,y) = (x-y)^{⊤}\Sigma^{-1}(x-y)$, where $\Sigma$ is the empirical covariance matrix of the historical data.
 
-Therefore, for distance metrics assuming independence of features (Kronecker delta and squared distance), given the new entity's experimental state with readouts over the feature set $F = \bigcup X_{e}$, where $e \in S$, we can calculate
+For distance metrics assuming independence of features (Kronecker delta and squared distance), given the new entity's experimental state with readouts over the feature set $F = \bigcup X_{e}$, where $e \in S$, we can calculate
 the distance from the $j$-th historical entity as $d_j = \sum_{x\in F} \rho_x (\hat x, x_j)$, where $\hat x$ and $x_j$ denote the readout
 for feature $x$ for the entity being tested and the entity recorded in $j$-th column.
-Mahalanobis distance directly takes in "rows", $\rho(\hat{x},x_{j})$.
+The squared Mahalanobis distance directly takes in "rows", $\rho(\hat{x},x_{j})$.
 
 Next, we convert distances $d_j$ into probabilistic weights $w_j$. By default, we use a rescaled exponential function, i.e.,
-we put $w_j = \exp(-\lambda d_j)$ for some $\lambda>0$. Notably, $\lambda$'s value determines how belief is distributed across the historical entities.
+we put $w_j = \exp(-\lambda d_j)$ with $\lambda=0.5$. Notably, $\lambda$'s value determines how belief is distributed across the historical entities.
 Larger values of $\lambda$ concentrate the belief tightly around the "closest" historical entities, while smaller values distribute more belief to more distant entities.
 
+Note that by choosing the squared Mahalanobis distance and the exponential function with a factor of $\lambda=0.5$, the resulting weight effectively equals the density of a [multivariate normal distribution](https://en.wikipedia.org/wiki/Multivariate_normal_distribution#Density_function) fitted to the historical data. A similar assertion applies when we use the sum of "standardized" squared distances instead. However, in this latter case, we "enforce" the independence of the features.
+
 The proper choice of distance and similarity metrics depends on insight into the dataset at hand. Weights can then be used to construct
 weighted histograms or density estimators for the posterior distributions of interest, or to directly resample historical rows.
 
@@ -243,7 +247,7 @@ returns three functions needed by CEEDesigns.jl:
 - `weights`: this represents a function of `evidence` that generates probability weights $w_j$ to each row in the historical data.
 
 By default, Euclidean distance is used as the distance metric. In the second
-call to `DistanceBased`, we instead use the Mahalanobis distance.
+call to `DistanceBased`, we instead use the squared Mahalanobis distance.
 It is possible to specify different distance metrics for each feature, see our
 [heart disease generative modeling](GenerativeDesigns.md) tutorial for more information.
 In both cases, the squared exponential function is used to convert distances
@@ -262,15 +266,16 @@ to weights.
     target = "y",
     uncertainty = Variance(),
     similarity = GenerativeDesigns.Exponential(; λ = 5),
-    distance = MahalanobisDistance(; diagonal = 0),
+    distance = SquaredMahalanobisDistance(; diagonal = 0),
 );
 nothing #hide
 ````
 
 We can look at the uncertainty in $y$ for a state where a single
-feature is "observed" at its mean value. Note that uncertainty is generally higher
-for the Mahalanobis distance, which makes sense as it takes into account the
-non-independence of the features.
+feature is "observed" at its mean value. Note that uncertainty is generally lower for the squared Mahalanobis distance.
+As we consider only a single non-missing entry, this makes sense as the implemented variant of the squared Mahalanobis distance for missing values effectively multiplies
+the other, quadratic distance, by a factor greater than one. For more details, refer to page 285 of
+[Multivariate outlier detection in incomplete survey data: The epidemic algorithm and transformed rank correlations](https://www.jstor.org/stable/3559861).
 
 ````@example SimpleGenerative
 data_uncertainties =
@@ -299,7 +304,7 @@ p2 = sticks(
 plot(p1, p2; layout = (1, 2), legend = false)
 ````
 
-We can view the posterior distribution $q(y|e_{S}$ conditioned on a state (here arbitrarily set to $S = e_{3}$, giving evidence for $x_{3}$).
+We can view the posterior distribution $q(y|e_{S})$ conditioned on a state (here arbitrarily set to $S = e_{3}$, giving evidence for $x_{3}$).
 
 ````@example SimpleGenerative
 evidence = Evidence("x3" => mean(data.x3))

diff --git a/docs/src/tutorials/SimpleStatic.jl b/docs/src/tutorials/SimpleStatic.jl
@@ -59,9 +59,9 @@
 # ### Optimal Arrangements
 
 # To find the optimal arrangement for each $S$ we need to know the cost of $O_{S}$. The monetary cost of $O_{S}$ is simply
-# the sum of the costs of each experiment:$m_{O_{S}}=\sum_{e\in S} m_{e}$.
+# the sum of the costs of each experiment: $m_{O_{S}}=\sum_{e\in S} m_{e}$.
 # The total time required is the sum of the maximum time *of each partition*. This is because while each partition in the 
-# arrangement is done in serial, experiments within partitions are done in parallel. That is, $t_{O_{S}}=\sum_{i=1}^{l} \text{max} \{ t_{e}: e \in o_{i}\}$
+# arrangement is done in serial, experiments within partitions are done in parallel. That is, $t_{O_{S}}=\sum_{i=1}^{l} \text{max} \{ t_{e}: e \in o_{i}\}$.
 # Given these costs and a parameter $\lambda$ which controls the tradeoff between monetary cost and time, the combined
 # cost of an arrangement is: $\lambda m_{O_{S}} + (1-\lambda) t_{O_{S}}$.
 # 

diff --git a/docs/src/tutorials/SimpleStatic.md b/docs/src/tutorials/SimpleStatic.md
@@ -63,9 +63,9 @@ $l=1$. Other arrangements fall between these extremes.
 ### Optimal Arrangements
 
 To find the optimal arrangement for each $S$ we need to know the cost of $O_{S}$. The monetary cost of $O_{S}$ is simply
-the sum of the costs of each experiment:$m_{O_{S}}=\sum_{e\in S} m_{e}$.
+the sum of the costs of each experiment: $m_{O_{S}}=\sum_{e\in S} m_{e}$.
 The total time required is the sum of the maximum time *of each partition*. This is because while each partition in the
-arrangement is done in serial, experiments within partitions are done in parallel. That is, $t_{O_{S}}=\sum_{i=1}^{l} \text{max} \{ t_{e}: e \in o_{i}\}$
+arrangement is done in serial, experiments within partitions are done in parallel. That is, $t_{O_{S}}=\sum_{i=1}^{l} \text{max} \{ t_{e}: e \in o_{i}\}$.
 Given these costs and a parameter $\lambda$ which controls the tradeoff between monetary cost and time, the combined
 cost of an arrangement is: $\lambda m_{O_{S}} + (1-\lambda) t_{O_{S}}$.
 

diff --git a/src/GenerativeDesigns/GenerativeDesigns.jl b/src/GenerativeDesigns/GenerativeDesigns.jl
@@ -14,7 +14,7 @@ using MCTS
 using ..CEEDesigns: front
 
 export UncertaintyReductionMDP, DistanceBased
-export QuadraticDistance, DiscreteDistance, MahalanobisDistance
+export QuadraticDistance, DiscreteDistance, SquaredMahalanobisDistance
 export Exponential
 export Variance, Entropy
 export Evidence, State