Improve documentation and introduce simple synthetic examples (#4)

Improved documentation and added multiple fixes. - Changed the way that an "uncertainty" is provided. That is, while "distances" and "similarities" are provided by a user through a function call, an "uncertainty" was originally given as a type. (#13) - Optimized computations in `distance based.jl` (#11) - Implemented a variant of sq. Mahalanobis distance with missing entries, see https://www.jstor.org/stable/3559861 on page 285, fixes #12 - Renamed `MahalanobisDistance` to `SquaredMahalanobisDistance` Fixes #11, #12, and #13 --------- Co-authored-by: Bíma, Jan <jan.bima@merck.com>
Merck · Dec 3, 2023 · 87ac18c · 87ac18c
1 parent d1d09c3
commit 87ac18c
Show file tree

Hide file tree

Showing 51 changed files with 4,660 additions and 491 deletions.
diff --git a/Project.toml b/Project.toml
@@ -1,19 +1,14 @@
 name = "CEEDesigns"
 uuid = "e939450b-799e-4198-a5f5-3f2f7fb1c671"
-version = "0.3.5"
+version = "0.3.6"
 
 [deps]
-Clustering = "aaaa29a8-35af-508c-8bc3-b662a17a0fe5"
 Combinatorics = "861a8166-3701-5b0c-9a16-15d98fcdc6aa"
 DataFrames = "a93c6f00-e57d-5684-b7b6-d8193f3e46c0"
-Distances = "b4f34e82-e78d-54a5-968a-f98e89d6e8f7"
-HTTP = "cd3eb016-35fb-5094-929b-558a96fad6f3"
 JSON = "682c06a0-de6a-54ab-a142-c8b1cf79cde6"
-LibPQ = "194296ae-ab2e-5f79-8cd4-7183a0a5a0d1"
 LinearAlgebra = "37e2e46d-f89d-539d-b4ee-838fcccc9c8e"
 MCTS = "e12ccd36-dcad-5f33-8774-9175229e7b33"
 MLJ = "add582a8-e3ab-11e8-2d5e-e98b27df1bc7"
-POMDPSimulators = "e0d0a172-29c6-5d4e-96d0-f262df5d01fd"
 POMDPTools = "7588e00f-9cae-40de-98dc-e0c70c48cdd7"
 POMDPs = "a93abf59-7444-517b-a68a-c42f96afdd7d"
 Plots = "91a5bcdd-55d7-5caf-9e0b-520d859cae80"
@@ -25,24 +20,19 @@ Statistics = "10745b16-79ce-11e8-11f9-7d13ad32a3b2"
 StatsBase = "2913bbd2-ae8a-5f71-8c99-4fb6c76f3a91"
 
 [compat]
-julia = "1.9"
-Plots = "1.39"
-ScientificTypes = "3.0"
-POMDPTools = "0.1"
-DataFrames = "1.6"
-HTTP = "1.10"
-LinearAlgebra = "1.9"
-LibPQ = "1.17"
 Combinatorics = "1.0"
-Statistics = "1.9"
-Random = "1.9"
-Reexport = "1.2"
-Distances = "0.10"
-POMDPs = "0.9"
+DataFrames = "1.6"
 JSON = "0.21"
-Clustering = "0.15"
+LinearAlgebra = "1.9"
 MCTS = "0.5"
 MLJ = "0.20"
+POMDPTools = "0.1"
+POMDPs = "0.9"
+Plots = "1.39"
+Random = "1.9"
+Reexport = "1.2"
 Requires = "1.3"
-POMDPSimulators = "0.3"
+ScientificTypes = "3.0"
+Statistics = "1.9"
 StatsBase = "0.34"
+julia = "1.9"
diff --git a/docs/Project.toml b/docs/Project.toml
@@ -2,14 +2,21 @@
 BetaML = "024491cd-cc6b-443e-8034-08ea7eb7db2b"
 CEEDesigns = "e939450b-799e-4198-a5f5-3f2f7fb1c671"
 CSV = "336ed68f-0bac-5ca0-87d4-7b16caf5d00b"
+Combinatorics = "861a8166-3701-5b0c-9a16-15d98fcdc6aa"
+Copulas = "ae264745-0b69-425e-9d9d-cf662c5eec93"
 D3Trees = "e3df1716-f71e-5df9-9e2d-98e193103c45"
 DataFrames = "a93c6f00-e57d-5684-b7b6-d8193f3e46c0"
+Distributions = "31c24e10-a181-5473-b8eb-7969acd0382f"
 Documenter = "e30172f5-a6a5-5a46-863b-614d45cd2de4"
 DocumenterMarkdown = "997ab1e6-3595-5248-9280-8efb232c3433"
 Literate = "98b081ad-f1c9-55d3-8b20-4c87d4299306"
 MCTS = "e12ccd36-dcad-5f33-8774-9175229e7b33"
 MLJ = "add582a8-e3ab-11e8-2d5e-e98b27df1bc7"
 MLJModels = "d491faf4-2d78-11e9-2867-c94bc002c0b7"
+POMDPTools = "7588e00f-9cae-40de-98dc-e0c70c48cdd7"
+POMDPs = "a93abf59-7444-517b-a68a-c42f96afdd7d"
 Plots = "91a5bcdd-55d7-5caf-9e0b-520d859cae80"
 ScientificTypes = "321657f4-b219-11e9-178b-2701a2544e81"
+Statistics = "10745b16-79ce-11e8-11f9-7d13ad32a3b2"
 StatsBase = "2913bbd2-ae8a-5f71-8c99-4fb6c76f3a91"
+StatsPlots = "f3b207a7-027a-5e70-b257-86293d7955fd"
diff --git a/docs/make.jl b/docs/make.jl
@@ -3,8 +3,13 @@ using CEEDesigns
 
 # Literate for tutorials
 const literate_dir = joinpath(@__DIR__, "..", "tutorials")
-const tutorials_src =
-    ["StaticDesigns.jl", "StaticDesignsFiltration.jl", "GenerativeDesigns.jl"]
+const tutorials_src = [
+    "SimpleStatic.jl",
+    "SimpleGenerative.jl",
+    "StaticDesigns.jl",
+    "StaticDesignsFiltration.jl",
+    "GenerativeDesigns.jl",
+]
 const generated_dir = joinpath(@__DIR__, "src", "tutorials/")
 
 # copy tutorials src
@@ -29,6 +34,8 @@ end
 pages = [
     "index.md",
     "Tutorials" => [
+        "tutorials/SimpleStatic.md",
+        "tutorials/SimpleGenerative.md",
         "tutorials/StaticDesigns.md",
         "tutorials/StaticDesignsFiltration.md",
         "tutorials/GenerativeDesigns.md",

diff --git a/docs/src/api.md b/docs/src/api.md
@@ -33,7 +33,7 @@ CEEDesigns.GenerativeDesigns.efficient_value
 CEEDesigns.GenerativeDesigns.DistanceBased
 CEEDesigns.GenerativeDesigns.QuadraticDistance
 CEEDesigns.GenerativeDesigns.DiscreteDistance
-CEEDesigns.GenerativeDesigns.MahalanobisDistance
+CEEDesigns.GenerativeDesigns.SquaredMahalanobisDistance
 CEEDesigns.GenerativeDesigns.Exponential
 ```
 

diff --git a/docs/src/index.md b/docs/src/index.md
@@ -7,11 +7,11 @@ A decision-making framework for the cost-efficient design of experiments, balanc
 ```
 
 ## Static experimental designs
-Here we assume that the same experimental design will be used for a population of examined entities, hence the word 'static'.
+Here we assume that the same experimental design will be used for a population of examined entities, hence the word "static".
 
 For each subset of experiments, we consider an estimate of the value of acquired information. To give an example, if a set of experiments is used to predict the value of a specific target variable, our framework can leverage a built-in integration with [MLJ.jl](https://github.com/alan-turing-institute/MLJ.jl) to estimate predictive accuracies of machine learning models fitted over subset of experimental features.
 
-In the cost-sensitive setting of CEEDesigns, a user provides the monetary cost and execution time of each experiment. Given the constraint on the maximum number of parallel experiments along with a fixed tradeoff between monetary cost and execution time, we devise an arrangement of each subset of experiments such that the expected combined cost is minimized.
+In the cost-sensitive setting of CEEDesigns.jl`, a user provides the monetary cost and execution time of each experiment. Given the constraint on the maximum number of parallel experiments along with a fixed tradeoff between monetary cost and execution time, we devise an arrangement of each subset of experiments such that the expected combined cost is minimized.
 
 Assuming the information values and optimized experimental costs for each subset of experiments, we then generate a set of cost-efficient experimental designs.
 
@@ -23,7 +23,7 @@ Assuming the information values and optimized experimental costs for each subset
 
 We consider 'personalized' experimental designs that dynamically adjust based on the evidence gathered from the experiments. This approach is motivated by the fact that the value of information collected from an experiment generally differs across subpopulations of the entities involved in the triage process.
 
-At the beginning of the triage process, an entity's prior data is used to project a range of cost-efficient experimental designs. Internally, while constructing these designs, we incorporate multiple-step-ahead lookups to model likely experimental outcomes and consider the subsequent decisions for each outcome. Then after choosing a specific decision policy from this set and acquiring additional experimental readouts (sampled from a generative model, hence the word 'generative'), we adjust the continuation based on this evidence.
+At the beginning of the triage process, an entity's prior data is used to project a range of cost-efficient experimental designs. Internally, while constructing these designs, we incorporate multiple-step-ahead lookups to model likely experimental outcomes and consider the subsequent decisions for each outcome. Then after choosing a specific decision policy from this set and acquiring additional experimental readouts (sampled from a generative model, hence the word "generative"), we adjust the continuation based on this evidence.
 
 ```@raw html
 <a><img src="assets/search_tree.png" align="left" alt="code" width="400"></a>