Skip to content

Commit

Permalink
text for SimpleStatic
Browse files Browse the repository at this point in the history
  • Loading branch information
slwu89 committed Nov 23, 2023
1 parent 6e436a1 commit a522106
Show file tree
Hide file tree
Showing 15 changed files with 1,417 additions and 37 deletions.
3 changes: 2 additions & 1 deletion docs/make.jl
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ using CEED
# Literate for tutorials
const literate_dir = joinpath(@__DIR__, "..", "tutorials")
const tutorials_src =
["StaticDesigns.jl", "StaticDesignsFiltration.jl", "GenerativeDesigns.jl"]
["SimpleStatic.jl", "StaticDesigns.jl", "StaticDesignsFiltration.jl", "GenerativeDesigns.jl"]
const generated_dir = joinpath(@__DIR__, "src", "tutorials/")

# copy tutorials src
Expand All @@ -29,6 +29,7 @@ end
pages = [
"index.md",
"Tutorials" => [
"tutorials/SimpleStatic.md",
"tutorials/StaticDesigns.md",
"tutorials/StaticDesignsFiltration.md",
"tutorials/GenerativeDesigns.md",
Expand Down
2 changes: 1 addition & 1 deletion docs/src/tutorials/GenerativeDesigns.jl
Original file line number Diff line number Diff line change
Expand Up @@ -249,7 +249,7 @@ designs = efficient_designs(
6,
evidence;
solver,
mdp_options = (; max_parallel = 2, costs_tradeoff = [0, 1.0]),
mdp_options = (; max_parallel = 2, costs_tradeoff = (0, 1.0)),
repetitions = 5,
);

Expand Down
2 changes: 1 addition & 1 deletion docs/src/tutorials/GenerativeDesigns.md
Original file line number Diff line number Diff line change
Expand Up @@ -287,7 +287,7 @@ designs = efficient_designs(
6,
evidence;
solver,
mdp_options = (; max_parallel = 2, costs_tradeoff = [0, 1.0]),
mdp_options = (; max_parallel = 2, costs_tradeoff = (0, 1.0)),
repetitions = 5,
);
nothing #hide
Expand Down
140 changes: 140 additions & 0 deletions docs/src/tutorials/SimpleStatic.jl
Original file line number Diff line number Diff line change
@@ -0,0 +1,140 @@
# # Static Experimental Designs

# Consider the following scenario. There exists a set of experiments, each of which, when performed, yields
# measurements on one or more observables (features). Each subset of observables (and therefore each subset of experiments)
# has some "information value", which is intentionally vaguely defined for generality, but for example, may be
# a loss function if that subset is used to train some machine learning model. It is generally the value of acquiring that information.
# Finally, each experiment has some monetary cost and execution time to perform the experiment, and
# the user has some known tradeoff between overall execution time and cost.
#
# CEED.jl provides tools to take these inputs and produce a set of optimal "arrangements" of experiments for each
# subset of experiments that form a Pareto front along the tradeoff between information gain and total combined cost
# (monetary and time). This allows informed decisions to be made, for example, regarding how to allocate scarce
# resources to a set of experiments that attain some acceptable level of information (or, conversely, reduce
# uncertainty below some level).
#
# The arrangements produced by the tools introduced in this tutorial are called "static" because they implicitly
# assume that future data will have exactly the information gain of each experiment as the "historical" input.
#
# This tutorial introduces the theoretical framework behind static experimental designs with synthetic data.
# For examples using real data, please see our other tutorials.

# ## Theoretical Framework

# ### Experiments

# Let $E = \{ e_1, \ldots, e_n\}$ be a set of $n$ experiments (i.e., $|E|=n$). Each experiment $e \in E$ has an
# associated tuple $(m_{e},t_{e})$, giving the monetary cost and time duration required to perform experiment $e$.
#
# Consider $P(E)$, the power set of experiments (i.e., every possible subset of experiments). Each subset of
# experiments $S\in P(E)$ has an associated value $v_{S}$, which is the value of the experiments contained in $S$.
# This may be given by the loss function associated with a prediction task where the information yielded from $S$
# is used as predictor variables, or some other notion of information value.

# ### Arrangements

# If experiments within a subset $S$ can be performed simultaneously (in parallel), then each $S$ may be arranged
# optimally with respect to time. An arrangement $O_{S}$ of $S$ is a partition of the experiments in $S$ such that
# the size of each partition is not larger than the maximum number of experiments that may be done in parallel.
#
# Let $l$ be the number of partitions, and $o_{i}$ a partition in $O_{S}$. Then the arrangement is a surjection from $S$
# onto $O_{S}$. If no experiments can be done in parallel, then $l=|S|$. If all experiments are done in parallel, then
# $l=1$. Other arrangements fall between these extremes.

# ### Optimal Arrangements

# To find the optimal arrangement for each $S$ we need to know the cost of $O_{S}$. The monetary cost of $O_{S}$ is simply
# the sum of the costs of each experiment:
# $$m_{O_{S}}=\sum_{e\in S} m_{e}$$
# The total time required is the sum of the maximum time *of each partition*. This is because while each partition in the
# arrangement is done in serial, experiments within partitions are done in parallel.
# $$t_{O_{S}}=\sum_{i=1}^{l} \text{max} \{ t_{e} e \in o_{i}\}$$
# Given these costs and a parameter $\lambda$ which controls the tradeoff between monetary cost and time, the combined
# cost of an arrangement is:
# $$\lambda m_{O_{S}} + (1-\lambda) t_{O_{S}}$$
#
# For instance, consider the experiments $S = \{e_{1},e_{2},e_{3},e_{4}\}$, with associated costs $(1, 1)$, $(1, 3)$, $(1, 2)$, and $(1, 4)$.
# If we conduct experiments $e_1$ through $e_4$ in sequence, this would correspond to an arrangement
# $O_{S} = (\{ e_1 \}, \{ e_2 \}, \{ e_3 \}, \{ e_4 \})$ with a total cost of $m_{O_{S}} = 4$ and $t_{O_{S}} = 10$.
#
# However, if we decide to conduct $e_1$ in parallel with $e_3$, and $e_2$ with $e_4$, we would obtain an arrangement
# $O_{S} = (\{ e_1, e_3 \}, \{ e_2, e_4 \})$ with a total cost of $m_{O_{S}} = 4$, and $t_{O_{S}} = 3 + 4 = 7$.
#
# Continuing our example and assuming a maximum of two parallel experiments, the optimal arrangement is to conduct
# $e_1$ in parallel with $e_2$, and $e_3$ with $e_4$. This results in an arrangement $O_{S} = (\{ e_1, e_2 \}, \{ e_3, e_4 \})$ with a total cost of $m_o = 4$ and $t_o = 2 + 4 = 6$.
#
# In fact, it can be readily demonstrated that the optimal arrangement can be found by ordering the experiments in
# S in descending order according to their execution times. Consequently, the experiments are grouped sequentially
# into partitions whose size equals the maximum number of parallel experiments, except possibly for the final set,
# if the maximum number of parallel experiments does not divide $S$ evenly.

# ## Synthetic Data Example

# First we load necessary packages.

using CEED, CEED.StaticDesigns
using Combinatorics: powerset
using DataFrames
using POMDPs, POMDPTools, MCTS

# This tutorial presents a synthetic example of using CEED to optimize static experimental design.
# We consider a situation where there are 3 experiments, and we draw a value of their "loss function"
# or "entropy" from the uniform distribution on the unit interval for each.
#
# For each $S\in P(E)$, we simulate the information value ($v_{S}$) of $S$ as the product of
# the values for each individual experiment.
# Therefore, because smaller values are better, any subset containing multiple experiments is guaranteed to be
# more "valuable" than any component experiment.

experiments = ["e1","e2","e3"];
experiments_val = Dict([e => rand() for e in experiments]);

experiments_evals = Dict(
map(Set.(collect(powerset(experiments)))) do s
if length(s) > 0
s => prod([experiments_val[i] for i in s])
else
return s => 1.0
end
end
);

# Better experiments are more costly, both in terms of time and monetary cost. We print
# the data frame showing each experiment and its associated costs.

experiments_costs = Dict(
sort(collect(keys(experiments_val)), by=k->experiments_val[k], rev=true) .=> tuple.(1:3,1:3)
);

DataFrame(
experiment=collect(keys(experiments_costs)),
time=getindex.(values(experiments_costs),1),
cost=getindex.(values(experiments_costs),2)
)

# We can plot the experiments ordered by their "loss function".

plot_evals(experiments_evals; f = x->sort(collect(keys(x)), by = k->x[k], rev=true), ylabel = "loss")

# We print the data frame showing each subset of experiments and its overall loss value.

DataFrame(
S=collect.(collect(keys(experiments_evals))),
value=collect(values(experiments_evals))
)

# Now we are ready to find the subsets of experiments giving an optimal tradeoff between information
# value and combined cost (where we use $\lambda=0.5$). CEED exports a function `efficient_designs`
# which formulates the problem of finding optimal arrangements as a Markov Decision Process and solves
# optimal arrangements for each subset on the Pareto frontier.
#
# Note that because we set the maximum number of parallel experiments equal to 2, the complete subset
# of experiments groups the experiments with long execution times together (see plot legend; each group/partition is
# prefixed with a number).

max_parallel = 2;
tradeoff = (0.5, 0.5);

designs = efficient_designs(experiments_costs, experiments_evals, max_parallel=max_parallel, tradeoff=tradeoff);

plot_front(designs; labels = make_labels(designs), ylabel = "loss")
158 changes: 158 additions & 0 deletions docs/src/tutorials/SimpleStatic.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,158 @@
```@meta
EditURL = "SimpleStatic.jl"
```

# Static Experimental Designs

Consider the following scenario. There exists a set of experiments, each of which, when performed, yields
measurements on one or more observables (features). Each subset of observables (and therefore each subset of experiments)
has some "information value", which is intentionally vaguely defined for generality, but for example, may be
a loss function if that subset is used to train some machine learning model. It is generally the value of acquiring that information.
Finally, each experiment has some monetary cost and execution time to perform the experiment, and
the user has some known tradeoff between overall execution time and cost.

CEED.jl provides tools to take these inputs and produce a set of optimal "arrangements" of experiments for each
subset of experiments that form a Pareto front along the tradeoff between information gain and total combined cost
(monetary and time). This allows informed decisions to be made, for example, regarding how to allocate scarce
resources to a set of experiments that attain some acceptable level of information (or, conversely, reduce
uncertainty below some level).

The arrangements produced by the tools introduced in this tutorial are called "static" because they implicitly
assume that future data will have exactly the information gain of each experiment as the "historical" input.

This tutorial introduces the theoretical framework behind static experimental designs with synthetic data.
For examples using real data, please see our other tutorials.

## Theoretical Framework

### Experiments

Let $E = \{ e_1, \ldots, e_n\}$ be a set of $n$ experiments (i.e., $|E|=n$). Each experiment $e \in E$ has an
associated tuple $(m_{e},t_{e})$, giving the monetary cost and time duration required to perform experiment $e$.

Consider $P(E)$, the power set of experiments (i.e., every possible subset of experiments). Each subset of
experiments $S\in P(E)$ has an associated value $v_{S}$, which is the value of the experiments contained in $S$.
This may be given by the loss function associated with a prediction task where the information yielded from $S$
is used as predictor variables, or some other notion of information value.

### Arrangements

If experiments within a subset $S$ can be performed simultaneously (in parallel), then each $S$ may be arranged
optimally with respect to time. An arrangement $O_{S}$ of $S$ is a partition of the experiments in $S$ such that
the size of each partition is not larger than the maximum number of experiments that may be done in parallel.

Let $l$ be the number of partitions, and $o_{i}$ a partition in $O_{S}$. Then the arrangement is a surjection from $S$
onto $O_{S}$. If no experiments can be done in parallel, then $l=|S|$. If all experiments are done in parallel, then
$l=1$. Other arrangements fall between these extremes.

### Optimal Arrangements

To find the optimal arrangement for each $S$ we need to know the cost of $O_{S}$. The monetary cost of $O_{S}$ is simply
the sum of the costs of each experiment:
$$m_{O_{S}}=\sum_{e\in S} m_{e}$$
The total time required is the sum of the maximum time *of each partition*. This is because while each partition in the
arrangement is done in serial, experiments within partitions are done in parallel.
$$t_{O_{S}}=\sum_{i=1}^{l} \text{max} \{ t_{e} e \in o_{i}\}$$
Given these costs and a parameter $\lambda$ which controls the tradeoff between monetary cost and time, the combined
cost of an arrangement is:
$$\lambda m_{O_{S}} + (1-\lambda) t_{O_{S}}$$

For instance, consider the experiments $S = \{e_{1},e_{2},e_{3},e_{4}\}$, with associated costs $(1, 1)$, $(1, 3)$, $(1, 2)$, and $(1, 4)$.
If we conduct experiments $e_1$ through $e_4$ in sequence, this would correspond to an arrangement
$O_{S} = (\{ e_1 \}, \{ e_2 \}, \{ e_3 \}, \{ e_4 \})$ with a total cost of $m_{O_{S}} = 4$ and $t_{O_{S}} = 10$.

However, if we decide to conduct $e_1$ in parallel with $e_3$, and $e_2$ with $e_4$, we would obtain an arrangement
$O_{S} = (\{ e_1, e_3 \}, \{ e_2, e_4 \})$ with a total cost of $m_{O_{S}} = 4$, and $t_{O_{S}} = 3 + 4 = 7$.

Continuing our example and assuming a maximum of two parallel experiments, the optimal arrangement is to conduct
$e_1$ in parallel with $e_2$, and $e_3$ with $e_4$. This results in an arrangement $O_{S} = (\{ e_1, e_2 \}, \{ e_3, e_4 \})$ with a total cost of $m_o = 4$ and $t_o = 2 + 4 = 6$.

In fact, it can be readily demonstrated that the optimal arrangement can be found by ordering the experiments in
S in descending order according to their execution times. Consequently, the experiments are grouped sequentially
into partitions whose size equals the maximum number of parallel experiments, except possibly for the final set,
if the maximum number of parallel experiments does not divide $S$ evenly.

## Synthetic Data Example

First we load necessary packages.

````@example SimpleStatic
using CEED, CEED.StaticDesigns
using Combinatorics: powerset
using DataFrames
using POMDPs, POMDPTools, MCTS
````

This tutorial presents a synthetic example of using CEED to optimize static experimental design.
We consider a situation where there are 3 experiments, and we draw a value of their "loss function"
or "entropy" from the uniform distribution on the unit interval for each.

For each $S\in P(E)$, we simulate the information value ($v_{S}$) of $S$ as the product of
the values for each individual experiment.
Therefore, because smaller values are better, any subset containing multiple experiments is guaranteed to be
more "valuable" than any component experiment.

````@example SimpleStatic
experiments = ["e1","e2","e3"];
experiments_val = Dict([e => rand() for e in experiments]);
experiments_evals = Dict(
map(Set.(collect(powerset(experiments)))) do s
if length(s) > 0
s => prod([experiments_val[i] for i in s])
else
return s => 1.0
end
end
);
nothing #hide
````

Better experiments are more costly, both in terms of time and monetary cost. We print
the data frame showing each experiment and its associated costs.

````@example SimpleStatic
experiments_costs = Dict(
sort(collect(keys(experiments_val)), by=k->experiments_val[k], rev=true) .=> tuple.(1:3,1:3)
);
DataFrame(
experiment=collect(keys(experiments_costs)),
time=getindex.(values(experiments_costs),1),
cost=getindex.(values(experiments_costs),2)
)
````

We can plot the experiments ordered by their "loss function".

````@example SimpleStatic
plot_evals(experiments_evals; f = x->sort(collect(keys(x)), by = k->x[k], rev=true), ylabel = "loss")
````

We print the data frame showing each subset of experiments and its overall loss value.

````@example SimpleStatic
DataFrame(
S=collect.(collect(keys(experiments_evals))),
value=collect(values(experiments_evals))
)
````

Now we are ready to find the subsets of experiments giving an optimal tradeoff between information
value and combined cost (where we use $\lambda=0.5$). CEED exports a function `efficient_designs`
which formulates the problem of finding optimal arrangements as a Markov Decision Process and solves
optimal arrangements for each subset on the Pareto frontier.

Note that because we set the maximum number of parallel experiments equal to 2, the complete subset
of experiments groups the experiments with long execution times together (see plot legend; each group/partition is
prefixed with a number).

````@example SimpleStatic
max_parallel = 2;
tradeoff = (0.5, 0.5);
designs = efficient_designs(experiments_costs, experiments_evals, max_parallel=max_parallel, tradeoff=tradeoff);
plot_front(designs; labels = make_labels(designs), ylabel = "loss")
````

Loading

0 comments on commit a522106

Please sign in to comment.