Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Performance improvements for zarr backend #1800

Merged
merged 38 commits into from
Jan 24, 2018
Merged
Changes from 1 commit
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
a0bea98
move backend append logic to the prepare_variable methods
Dec 21, 2017
afdb254
deprecate variables/dimensions/attrs properties on AbstractWritableDa…
Dec 22, 2017
cc02150
warnings instead of errors for backend properties
Dec 24, 2017
86240cd
use attrs.update when setting zarr attributes
Dec 26, 2017
9c89ef2
more performance improvements to attributes in zarr backend
Dec 26, 2017
2568d21
Merge branch 'master' of github.com:pydata/xarray into fix/zarr_set_a…
Dec 28, 2017
d459c66
fix typo
Dec 28, 2017
c59ca57
Merge branch 'master' of github.com:pydata/xarray into fix/zarr_set_a…
Dec 28, 2017
2dd186a
Merge branch 'fix/zarr_set_attrs' of github.com:jhamman/xarray into f…
Dec 28, 2017
8f71b31
new set_dimensions method for writable data stores
Jan 2, 2018
07b9c21
Merge branch 'fix/zarr_set_attrs' of github.com:jhamman/xarray into f…
Jan 2, 2018
67fcd92
more fixes for zarr
Jan 2, 2018
b38e1a6
more tests for zarr and remove append logic for zarr
Jan 2, 2018
47ba8b6
more tests for zarr and remove append logic for zarr
Jan 2, 2018
9152b12
Merge branch 'fix/zarr_set_attrs' of github.com:jhamman/xarray into f…
Jan 2, 2018
26b6bcb
a few more tweaks to zarr attrs
Jan 2, 2018
b7681ae
Add encode methods to writable data stores, fixes for Zarr tests
Jan 4, 2018
e084e9e
fix for InMemoryDataStore
Jan 5, 2018
a6aeb36
fix for unlimited dimensions Scipy Datastores
Jan 5, 2018
264b13f
another patch for scipy
Jan 5, 2018
9c03bfc
whatsnew
Jan 6, 2018
c92020a
ordereddict
Jan 7, 2018
18434f9
address some of rabernats comments, in particular, this commit remove…
Jan 9, 2018
9f89c7c
stop skipping zero-dim zarr tests
Jan 9, 2018
3590d28
update minimum zarr version for tests
Jan 9, 2018
69cacee
Merge branch 'master' into fix/zarr_set_attrs
Jan 9, 2018
8d744e0
Merge branch 'master' into fix/zarr_set_attrs
Jan 9, 2018
a8dabdf
Merge branch 'master' of github.com:pydata/xarray into fix/zarr_set_a…
Jan 10, 2018
7858db7
Merge branch 'fix/zarr_set_attrs' of github.com:jhamman/xarray into f…
Jan 10, 2018
48bf7ef
Merge branch 'master' of github.com:pydata/xarray into fix/zarr_set_a…
Jan 11, 2018
53260c9
Merge branch 'master' of github.com:pydata/xarray into fix/zarr_set_a…
Jan 12, 2018
7ed6bf8
cleanup and docs for zarr performance branch
Jan 13, 2018
3872da2
fix two failing tests when using zarr master
Jan 14, 2018
e6b7068
Merge branch 'master' of github.com:pydata/xarray into fix/zarr_set_a…
Jan 14, 2018
c31decf
flake8
Jan 15, 2018
189d262
back to zarr 2.2
Jan 15, 2018
07b92e2
Merge branch 'master' of github.com:pydata/xarray into fix/zarr_set_a…
Jan 15, 2018
96996ef
remove extra store method
Jan 16, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 1 addition & 5 deletions xarray/tests/test_backends.py
Original file line number Diff line number Diff line change
Expand Up @@ -1249,7 +1249,7 @@ def test_write_persistence_modes(self):
self.save(original, store, mode='w')
with self.open(store) as actual:
self.assertDatasetIdentical(original, actual)
with pytest.raises(KeyError):
with pytest.raises(ValueError):
self.save(original, store, mode='w-')

# check that we can't use other persistence modes
Expand All @@ -1273,10 +1273,6 @@ def test_group(self):
with self.roundtrip(original, save_kwargs={'group': group},
open_kwargs={'group': group}) as actual:
self.assertDatasetIdentical(original, actual)
with pytest.raises(KeyError):
with self.roundtrip(original,
save_kwargs={'group': group}) as actual:
self.assertDatasetIdentical(original, actual)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you remove this part of the test?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an interesting one. I wanted to see if the tests all passed after this change before bringing this up for discussion.

The key error that is raised here was being raised when looking for the _DIMENSIONS_KEY on the group. Now that each group doesn't have that key, it obviously isn't raised. So the question up for discussion is what should we do when we open a zarr group that doesn't have anything in it? Currently, an empty group is returned as an empty Dataset.

Copy link
Contributor

@rabernat rabernat Jan 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is related to my earlier comment. I think we should pick a behavior we want to support. Either

  • You can open any zarr group with xarray
  • You can only open zarr groups that themselves were created by xarray

The current docs say the latter. However, having eliminated _DIMENSION_KEY from the group attributes, there is no way to check for whether the group was created by xarray if there are no variables.

I can't imagine any scenario in which a user would want to open an empty zarr group in read-only mode. That would be useless. I imagine that this means the dataset was written with a group path but then the user forgot to specify a path when reading. So let's raise an error for now if the group is empty.

In a future PR, we can figure out how to more comprehensively support the reading of zarr datasets that do not have xarray's special attributes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to play devils advocate for a moment, The current behavior is consistent with how we treat groups in netCDF4. Consider the following example:

# create a dataset with a group and nothing in the root group
In [16]: import netCDF4 as nc4

In [17]: rg = nc4.Dataset('/Users/jhamman/workdir/junk.nc', 'w')

In [18]: g1 = rg.createGroup('some/random/path')

In [19]: g1
Out[19]:
<class 'netCDF4._netCDF4.Group'>
group /some/random/path:
    dimensions(sizes):
    variables(dimensions):
    groups:

In [20]: g1.createDimension('time', None)
Out[20]: <class 'netCDF4._netCDF4.Dimension'> (unlimited): name = 'time', size = 0

In [22]: time = g1.createVariable('time', 'f8', ('time', ))

In [23]: time[:] = 30

In [25]: rg
Out[25]:
<class 'netCDF4._netCDF4.Dataset'>
root group (NETCDF4 data model, file format HDF5):
    dimensions(sizes):
    variables(dimensions):
    groups: some

In [26]: rg.close()

# opening the dataset in xarray returns an empty dataset
In [27]: xr.open_dataset('/Users/jhamman/workdir/junk.nc')
Out[27]:
<xarray.Dataset>
Dimensions:  ()
Data variables:
    *empty*

# but if you specify the group, you get the expected dataset
In [28]: xr.open_dataset('/Users/jhamman/workdir/junk.nc', group='some/random/path')
Out[28]:
<xarray.Dataset>
Dimensions:  (time: 1)
Coordinates:
  * time     (time) float64 30.0
Data variables:
    *empty*

If we were to raise an error in the Zarr backend, would we want to check for the presence of variables or attributes?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, that's a pretty good point.

I'm fine with leaving as is.


# TODO: implement zarr object encoding and make these tests pass
@pytest.mark.xfail(reason="Zarr object encoding not implemented")
Expand Down