-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement DataTree as data structure #537
Open
veni-vidi-vici-dormivi
wants to merge
49
commits into
MESMER-group:main
Choose a base branch
from
veni-vidi-vici-dormivi:trees
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
49 commits
Select commit
Hold shift + click to select a range
e1a2dc4
add datatree to dependencies
veni-vidi-vici-dormivi 0fec170
add collapse datatree util
veni-vidi-vici-dormivi 23b7a51
switch utils in init
veni-vidi-vici-dormivi 4d253c9
nit
veni-vidi-vici-dormivi 8fc314f
implement autoregression
veni-vidi-vici-dormivi de4850f
test new autoregression functionality
veni-vidi-vici-dormivi c80869d
implement scenario weights func
veni-vidi-vici-dormivi ce198ac
adapt collapse datatree
veni-vidi-vici-dormivi d99973e
linting
veni-vidi-vici-dormivi 397cef6
add fielfinder dependency
veni-vidi-vici-dormivi bb81f54
change and add integration tests (not consistent)
veni-vidi-vici-dormivi 353d561
adapt old codepath autoregression_scen_ens
veni-vidi-vici-dormivi 025fba4
linting
veni-vidi-vici-dormivi 83cf397
Merge branch 'main' into trees
veni-vidi-vici-dormivi 109cf92
update dependencies filefinder
veni-vidi-vici-dormivi be4c96c
implement and deprecate list in autoregression_scen_ens
veni-vidi-vici-dormivi 3c51385
fix autoregression call in test
veni-vidi-vici-dormivi 66e15f0
downgrade datatree and add pip dep
veni-vidi-vici-dormivi 5a29072
add activating env
veni-vidi-vici-dormivi 147fdc7
remove activate again
veni-vidi-vici-dormivi c7a7f1c
Merge branch 'main' into trees
veni-vidi-vici-dormivi 356cfa9
try verifying micromamba path
veni-vidi-vici-dormivi 24a5691
forgot to add it
veni-vidi-vici-dormivi fc10078
revert changes in ci-workflow
veni-vidi-vici-dormivi 0286046
expand weighting function
veni-vidi-vici-dormivi 452e0b6
add todo
veni-vidi-vici-dormivi 6943c62
expand collapse_dt tests
veni-vidi-vici-dormivi 93b5b4c
refine collapse datatree
veni-vidi-vici-dormivi 4e6063c
linting in weighted
veni-vidi-vici-dormivi 876ffcb
add datatree to arraydict
veni-vidi-vici-dormivi 03f0083
implement seed dict in autoregression
veni-vidi-vici-dormivi 0137af4
adapt linear regression
veni-vidi-vici-dormivi 156ede7
adapt volc
veni-vidi-vici-dormivi df8e814
init
veni-vidi-vici-dormivi e791502
linting
veni-vidi-vici-dormivi d37cbdd
add tas**2 test
veni-vidi-vici-dormivi 5b8751f
add hfds tests
veni-vidi-vici-dormivi 046347c
test stack_linear_regression_data
veni-vidi-vici-dormivi 7f38ce3
nit
veni-vidi-vici-dormivi dd31eed
Update mesmer/core/weighted.py
veni-vidi-vici-dormivi 18d8b07
Update mesmer/core/utils.py
veni-vidi-vici-dormivi a62c6a4
Update mesmer/core/utils.py
veni-vidi-vici-dormivi f13734c
Update mesmer/core/utils.py
veni-vidi-vici-dormivi b5be5fd
broadcast~ed~
veni-vidi-vici-dormivi 5223e26
outsurce datatree utils
veni-vidi-vici-dormivi 1f5be62
renaming
veni-vidi-vici-dormivi c6aee41
get rid of dt to arraydict
veni-vidi-vici-dormivi 422347e
fixes
veni-vidi-vici-dormivi 94c2e42
linting
veni-vidi-vici-dormivi File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,161 @@ | ||
import xarray as xr | ||
from datatree import DataTree | ||
|
||
|
||
def _extract_single_dataarray_from_dt(dt: DataTree) -> xr.DataArray: | ||
""" | ||
Extract a single DataArray from a DataTree node, holding one ``Dataset`` with one ``DataArray``. | ||
""" | ||
# assert only one node in dt | ||
if not len(list(dt.subtree)) == 1: | ||
raise ValueError("DataTree must only contain one node.") | ||
if not dt.has_data: | ||
raise ValueError("DataTree must contain data.") | ||
|
||
ds = dt.to_dataset() | ||
if len(ds.data_vars) != 1: | ||
raise ValueError("DataTree must have exactly one data variable.") | ||
|
||
varname = list(ds.data_vars)[0] | ||
da = ds.to_array().isel(variable=0).drop_vars("variable") | ||
return da.rename(varname) | ||
|
||
|
||
def collapse_datatree_into_dataset(dt: DataTree, dim: str) -> xr.Dataset: | ||
""" | ||
Take a ``DataTree`` and collapse **all subtrees** in it into a single ``xr.Dataset`` along dim. | ||
All datasets in the ``DataTree`` must have the same dimensions and each dimension must have a coordinate. | ||
|
||
Parameters | ||
---------- | ||
dt : DataTree | ||
The DataTree to collapse. | ||
dim : str | ||
The dimension to concatenate the datasets along. | ||
|
||
Returns | ||
------- | ||
xr.Dataset | ||
The collapsed dataset. | ||
|
||
Raises | ||
------ | ||
ValueError | ||
If all datasets do not have the same dimensions. | ||
If any dimension does not have a coordinate. | ||
""" | ||
# TODO: could potentially be replaced by DataTree.merge_child_nodes in the future? | ||
datasets = [subtree.to_dataset() for subtree in dt.subtree if not subtree.is_empty] | ||
|
||
# Check if all datasets have the same dimensions | ||
first_dims = set(datasets[0].dims) | ||
if not all(set(ds.dims) == first_dims for ds in datasets): | ||
raise ValueError("All datasets must have the same dimensions") | ||
|
||
# Check that all dimensions have coordinates | ||
for ds in datasets: | ||
for ds_dim in ds.dims: | ||
if ds[ds_dim].coords == {}: | ||
raise ValueError( | ||
f"Dimension '{ds_dim}' must have a coordinate/coordinates." | ||
) | ||
|
||
# Concatenate datasets along the specified dimension | ||
ds = xr.concat(datasets, dim=dim) | ||
ds = ds.assign_coords( | ||
{dim: [subtree.name for subtree in dt.subtree if not subtree.is_empty]} | ||
) | ||
|
||
return ds | ||
|
||
|
||
def stack_linear_regression_datatrees( | ||
predictors: DataTree, | ||
target: DataTree, | ||
weights: DataTree | None, | ||
*, | ||
stacking_dims: list[str], | ||
collapse_dim: str = "scenario", | ||
stacked_dim: str = "sample", | ||
) -> tuple[DataTree, xr.Dataset, xr.Dataset | None]: | ||
""" | ||
prepares data for Linear Regression: | ||
1. Broadcasts predictors to target | ||
2. Collapses DataTrees into DataSets | ||
3. Stacks the DataSets along the stacking dimension(s) | ||
|
||
Parameters | ||
---------- | ||
predictors : DataTree | ||
A ``DataTree`` of ``xr.Dataset`` objects used as predictors. The ``DataTree`` | ||
must have subtrees for each predictor each of which has to have at least one | ||
leaf, holding a ``xr.Dataset`` representing a scenario. The subtrees of | ||
different predictors must be isomorphic (i.e. have the save scenarios). The ``xr.Dataset`` | ||
must at least contain `dim` and each ``xr.Dataset`` must only hold one data variable. | ||
target : DataTree | ||
A ``DataTree``holding the targets. Must be isomorphic to the predictor subtrees, i.e. | ||
have the same scenarios. Each leaf must hold a ``xr.Dataset`` which must be at least 2D | ||
and contain `dim`, but may also contain a dimension for ensemble members. | ||
weights : DataTree, default: None. | ||
Individual weights for each sample, must be isomorphic to target. Must at least contain | ||
`dim`, and must have the ensemble member dimesnion if target has it. | ||
stacking_dims : list[str] | ||
Dimension(s) to stack. | ||
collapse_dim : str, default: "scenario" | ||
Dimension along which to collapse the DataTrees, will automatically be added to the | ||
stacking dims. | ||
stacked_dim : str, default: "sample" | ||
Name of the stacked dimension. | ||
|
||
Returns | ||
------- | ||
tuple | ||
Tuple of the prepared predictors, target and weights, where the predictors and target are | ||
stacked along the stacking dimensions and the weights are stacked along the stacking dimensions | ||
and the ensemble member dimension. | ||
|
||
Notes | ||
----- | ||
Dimensions which exist along the target but are not in the stacking_dims will be excluded from the | ||
broadcasting of the predictors. | ||
""" | ||
|
||
stacking_dims_all = stacking_dims + [collapse_dim] | ||
|
||
# exclude target dimensions from broadcasting which are not in the stacking_dims | ||
exclude_dim = set(target.leaves[0].ds.dims) - set(stacking_dims) | ||
|
||
# predictors need to be | ||
predictors_stacked = DataTree() | ||
for key, subtree in predictors.items(): | ||
# 1) broadcast to target | ||
pred_broadcast = subtree.broadcast_like(target, exclude=exclude_dim) | ||
# 2) collapsed into DataSets | ||
predictor_ds = collapse_datatree_into_dataset(pred_broadcast, dim=collapse_dim) | ||
# 3) stacked | ||
predictors_stacked[key] = DataTree( | ||
predictor_ds.stack( | ||
{stacked_dim: stacking_dims_all}, create_index=False | ||
).dropna(dim=stacked_dim) | ||
) | ||
|
||
# target needs to be | ||
# 1) collapsed into DataSet | ||
target_ds = collapse_datatree_into_dataset(target, dim=collapse_dim) | ||
# 2) stacked | ||
target_stacked = target_ds.stack( | ||
{stacked_dim: stacking_dims_all}, create_index=False | ||
).dropna(dim=stacked_dim) | ||
|
||
# weights need to be | ||
if weights is not None: | ||
# 1) collapsed into DataSet | ||
weights_ds = collapse_datatree_into_dataset(weights, dim=collapse_dim) | ||
# 2) stacked | ||
weights_stacked = weights_ds.stack( | ||
{stacked_dim: stacking_dims_all}, create_index=False | ||
).dropna(dim=stacked_dim) | ||
else: | ||
weights_stacked = None | ||
|
||
return predictors_stacked, target_stacked, weights_stacked |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These typing fixes are good - could extract them (but also ok to keep them if that's too annoying)
Just FYI - the list of which 'protocol' needs to have which methods: https://docs.python.org/3/library/collections.abc.html