-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Performance improvements for zarr backend #1800
Changes from 33 commits
a0bea98
afdb254
cc02150
86240cd
9c89ef2
2568d21
d459c66
c59ca57
2dd186a
8f71b31
07b9c21
67fcd92
b38e1a6
47ba8b6
9152b12
26b6bcb
b7681ae
e084e9e
a6aeb36
264b13f
9c03bfc
c92020a
18434f9
9f89c7c
3590d28
69cacee
8d744e0
a8dabdf
7858db7
48bf7ef
53260c9
7ed6bf8
3872da2
e6b7068
c31decf
189d262
07b92e2
96996ef
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -6,7 +6,7 @@ | |
import time | ||
import traceback | ||
import contextlib | ||
from collections import Mapping | ||
from collections import Mapping, OrderedDict | ||
import warnings | ||
|
||
from ..conventions import cf_encoder | ||
|
@@ -96,6 +96,9 @@ def __getitem__(self, key): | |
def __len__(self): | ||
return len(self.variables) | ||
|
||
def get_dimensions(self): # pragma: no cover | ||
raise NotImplementedError | ||
|
||
def get_attrs(self): # pragma: no cover | ||
raise NotImplementedError | ||
|
||
|
@@ -195,6 +198,37 @@ def __init__(self, writer=None): | |
writer = ArrayWriter() | ||
self.writer = writer | ||
|
||
def encode(self, variables, attributes): | ||
""" | ||
Encode the variables and attributes in this store | ||
|
||
Parameters | ||
---------- | ||
variables : dict-like | ||
Dictionary of key/value (variable name / xr.Variable) pairs | ||
attributes : dict-like | ||
Dictionary of key/value (attribute name / attribute) pairs | ||
|
||
Returns | ||
------- | ||
variables : dict-like | ||
attributes : dict-like | ||
|
||
""" | ||
variables = OrderedDict([(k, self.encode_variable(v)) | ||
for k, v in variables.items()]) | ||
attributes = OrderedDict([(k, self.encode_attribute(v)) | ||
for k, v in attributes.items()]) | ||
return variables, attributes | ||
|
||
def encode_variable(self, v): | ||
"""encode one variable""" | ||
return v | ||
|
||
def encode_attribute(self, a): | ||
"""encode one attribute""" | ||
return a | ||
|
||
def set_dimension(self, d, l): # pragma: no cover | ||
raise NotImplementedError | ||
|
||
|
@@ -208,24 +242,74 @@ def sync(self): | |
self.writer.sync() | ||
|
||
def store_dataset(self, dataset): | ||
# in stores variables are all variables AND coordinates | ||
# in xarray.Dataset variables are variables NOT coordinates, | ||
# so here we pass the whole dataset in instead of doing | ||
# dataset.variables | ||
""" | ||
in stores, variables are all variables AND coordinates | ||
in xarray.Dataset variables are variables NOT coordinates, | ||
so here we pass the whole dataset in instead of doing | ||
dataset.variables | ||
""" | ||
self.store(dataset, dataset.attrs) | ||
|
||
def store(self, variables, attributes, check_encoding_set=frozenset(), | ||
unlimited_dims=None): | ||
""" | ||
Top level method for putting data on this store, this method: | ||
- encodes variables/attributes | ||
- sets dimensions | ||
- sets variables | ||
|
||
Parameters | ||
---------- | ||
variables : dict-like | ||
Dictionary of key/value (variable name / xr.Variable) pairs | ||
attributes : dict-like | ||
Dictionary of key/value (attribute name / attribute) pairs | ||
check_encoding_set : list-like | ||
List of variables that should be checked for invalid encoding | ||
values | ||
unlimited_dims : list-like | ||
List of dimension names that should be treated as unlimited | ||
dimensions. | ||
""" | ||
|
||
variables, attributes = self.encode(variables, attributes) | ||
|
||
self.set_attributes(attributes) | ||
self.set_dimensions(variables, unlimited_dims=unlimited_dims) | ||
self.set_variables(variables, check_encoding_set, | ||
unlimited_dims=unlimited_dims) | ||
|
||
def set_attributes(self, attributes): | ||
""" | ||
This provides a centralized method to set the dataset attributes on the | ||
data store. | ||
|
||
Parameters | ||
---------- | ||
attributes : dict-like | ||
Dictionary of key/value (attribute name / attribute) pairs | ||
""" | ||
for k, v in iteritems(attributes): | ||
self.set_attribute(k, v) | ||
|
||
def set_variables(self, variables, check_encoding_set, | ||
unlimited_dims=None): | ||
""" | ||
This provides a centralized method to set the variables on the data | ||
store. | ||
|
||
Parameters | ||
---------- | ||
variables : dict-like | ||
Dictionary of key/value (variable name / xr.Variable) pairs | ||
check_encoding_set : list-like | ||
List of variables that should be checked for invalid encoding | ||
values | ||
unlimited_dims : list-like | ||
List of dimension names that should be treated as unlimited | ||
dimensions. | ||
""" | ||
|
||
for vn, v in iteritems(variables): | ||
name = _encode_variable_name(vn) | ||
check = vn in check_encoding_set | ||
|
@@ -234,23 +318,54 @@ def set_variables(self, variables, check_encoding_set, | |
|
||
self.writer.add(source, target) | ||
|
||
def set_necessary_dimensions(self, variable, unlimited_dims=None): | ||
def set_dimensions(self, variables, unlimited_dims=None): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you add a docstring or comment explaining what this method does? This would help new developers (including myself) come onboard with backend development. |
||
""" | ||
This provides a centralized method to set the dimensions on the data | ||
store. | ||
|
||
Parameters | ||
---------- | ||
variables : dict-like | ||
Dictionary of key/value (variable name / xr.Variable) pairs | ||
unlimited_dims : list-like | ||
List of dimension names that should be treated as unlimited | ||
dimensions. | ||
""" | ||
if unlimited_dims is None: | ||
unlimited_dims = set() | ||
dims = self.get_dimensions() | ||
for d, l in zip(variable.dims, variable.shape): | ||
if d not in dims: | ||
|
||
existing_dims = self.get_dimensions() | ||
|
||
dims = OrderedDict() | ||
for v in unlimited_dims: # put unlimited_dims first | ||
dims[v] = None | ||
for v in variables.values(): | ||
dims.update(dict(zip(v.dims, v.shape))) | ||
|
||
for d, l in dims.items(): | ||
|
||
if d in existing_dims and l != existing_dims[d]: | ||
raise ValueError("Unable to update size for existing dimension" | ||
"%r (%d != %d)" % (d, l, existing_dims[d])) | ||
elif d not in existing_dims: | ||
is_unlimited = d in unlimited_dims | ||
self.set_dimension(d, l, is_unlimited) | ||
|
||
|
||
class WritableCFDataStore(AbstractWritableDataStore): | ||
|
||
def store(self, variables, attributes, *args, **kwargs): | ||
def encode(self, variables, attributes): | ||
# All NetCDF files get CF encoded by default, without this attempting | ||
# to write times, for example, would fail. | ||
cf_variables, cf_attrs = cf_encoder(variables, attributes) | ||
AbstractWritableDataStore.store(self, cf_variables, cf_attrs, | ||
variables, attributes = cf_encoder(variables, attributes) | ||
variables = OrderedDict([(k, self.encode_variable(v)) | ||
for k, v in variables.items()]) | ||
attributes = OrderedDict([(k, self.encode_attribute(v)) | ||
for k, v in attributes.items()]) | ||
return variables, attributes | ||
|
||
def store(self, variables, attributes, *args, **kwargs): | ||
AbstractWritableDataStore.store(self, variables, attributes, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You can remove this method -- the implementation is exactly the parent class method. |
||
*args, **kwargs) | ||
|
||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is definitely cleaner than what we had before, but I am reluctant to give the idea that this this is a new supported method for third-party DataStore classes. Maybe we can call this
_encode
for now, or add a warning about implementing it?Eventually, I would like to remove all implementations from the DataStore base classes, and leave them as purely abstract. This will make it clearer to new backend implementers what they actually should/can implement.
So instead of implementing an
encode()
method, data store classes could have a list of defaultencoders
(see xarray.coding) used when reading/writing data. But xarray.coding isn't quite ready for this yet...There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clearly some refactoring will be needed to this once the overall backend refactoring moves forward. For now, however, this seems like a reasonable compromise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shoyer - I'm a bit confused here. As you'll see in the Zarr backend, the
encode_variable
method is applying a list of encoders. Where in theWritableDataStore
were you envisioning the application of the encoders?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shoyer - What would you say to merging this in its current state and leaving the encoders refactor to a separate PR? I'm happy to make more changes here but a) I'm not sure how to address your last comment, and b) I've already drifted a fair ways off track with this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with that. We may want to change it more in the future but this is a clear improvement for now.