Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Performance improvements for zarr backend #1800

Merged
merged 38 commits into from
Jan 24, 2018
Merged
Show file tree
Hide file tree
Changes from 33 commits
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
a0bea98
move backend append logic to the prepare_variable methods
Dec 21, 2017
afdb254
deprecate variables/dimensions/attrs properties on AbstractWritableDa…
Dec 22, 2017
cc02150
warnings instead of errors for backend properties
Dec 24, 2017
86240cd
use attrs.update when setting zarr attributes
Dec 26, 2017
9c89ef2
more performance improvements to attributes in zarr backend
Dec 26, 2017
2568d21
Merge branch 'master' of github.com:pydata/xarray into fix/zarr_set_a…
Dec 28, 2017
d459c66
fix typo
Dec 28, 2017
c59ca57
Merge branch 'master' of github.com:pydata/xarray into fix/zarr_set_a…
Dec 28, 2017
2dd186a
Merge branch 'fix/zarr_set_attrs' of github.com:jhamman/xarray into f…
Dec 28, 2017
8f71b31
new set_dimensions method for writable data stores
Jan 2, 2018
07b9c21
Merge branch 'fix/zarr_set_attrs' of github.com:jhamman/xarray into f…
Jan 2, 2018
67fcd92
more fixes for zarr
Jan 2, 2018
b38e1a6
more tests for zarr and remove append logic for zarr
Jan 2, 2018
47ba8b6
more tests for zarr and remove append logic for zarr
Jan 2, 2018
9152b12
Merge branch 'fix/zarr_set_attrs' of github.com:jhamman/xarray into f…
Jan 2, 2018
26b6bcb
a few more tweaks to zarr attrs
Jan 2, 2018
b7681ae
Add encode methods to writable data stores, fixes for Zarr tests
Jan 4, 2018
e084e9e
fix for InMemoryDataStore
Jan 5, 2018
a6aeb36
fix for unlimited dimensions Scipy Datastores
Jan 5, 2018
264b13f
another patch for scipy
Jan 5, 2018
9c03bfc
whatsnew
Jan 6, 2018
c92020a
ordereddict
Jan 7, 2018
18434f9
address some of rabernats comments, in particular, this commit remove…
Jan 9, 2018
9f89c7c
stop skipping zero-dim zarr tests
Jan 9, 2018
3590d28
update minimum zarr version for tests
Jan 9, 2018
69cacee
Merge branch 'master' into fix/zarr_set_attrs
Jan 9, 2018
8d744e0
Merge branch 'master' into fix/zarr_set_attrs
Jan 9, 2018
a8dabdf
Merge branch 'master' of github.com:pydata/xarray into fix/zarr_set_a…
Jan 10, 2018
7858db7
Merge branch 'fix/zarr_set_attrs' of github.com:jhamman/xarray into f…
Jan 10, 2018
48bf7ef
Merge branch 'master' of github.com:pydata/xarray into fix/zarr_set_a…
Jan 11, 2018
53260c9
Merge branch 'master' of github.com:pydata/xarray into fix/zarr_set_a…
Jan 12, 2018
7ed6bf8
cleanup and docs for zarr performance branch
Jan 13, 2018
3872da2
fix two failing tests when using zarr master
Jan 14, 2018
e6b7068
Merge branch 'master' of github.com:pydata/xarray into fix/zarr_set_a…
Jan 14, 2018
c31decf
flake8
Jan 15, 2018
189d262
back to zarr 2.2
Jan 15, 2018
07b92e2
Merge branch 'master' of github.com:pydata/xarray into fix/zarr_set_a…
Jan 15, 2018
96996ef
remove extra store method
Jan 16, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions doc/whats-new.rst
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,9 @@ Enhancements
- Use ``pandas.Grouper`` class in xarray resample methods rather than the
deprecated ``pandas.TimeGrouper`` class (:issue:`1766`).
By `Joe Hamman <https://github.com/jhamman>`_.
- Support for using `Zarr`_ as storage layer for xarray. (:issue:`1223`).
By `Ryan Abernathey <https://github.com/rabernat>`_ and
`Joe Hamman <https://github.com/jhamman>`_.
- Support for using `Zarr`_ as storage layer for xarray.
By `Ryan Abernathey <https://github.com/rabernat>`_.
- :func:`xarray.plot.imshow` now handles RGB and RGBA images.
Expand Down
139 changes: 127 additions & 12 deletions xarray/backends/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
import time
import traceback
import contextlib
from collections import Mapping
from collections import Mapping, OrderedDict
import warnings

from ..conventions import cf_encoder
Expand Down Expand Up @@ -96,6 +96,9 @@ def __getitem__(self, key):
def __len__(self):
return len(self.variables)

def get_dimensions(self): # pragma: no cover
raise NotImplementedError

def get_attrs(self): # pragma: no cover
raise NotImplementedError

Expand Down Expand Up @@ -195,6 +198,37 @@ def __init__(self, writer=None):
writer = ArrayWriter()
self.writer = writer

def encode(self, variables, attributes):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is definitely cleaner than what we had before, but I am reluctant to give the idea that this this is a new supported method for third-party DataStore classes. Maybe we can call this _encode for now, or add a warning about implementing it?

Eventually, I would like to remove all implementations from the DataStore base classes, and leave them as purely abstract. This will make it clearer to new backend implementers what they actually should/can implement.

So instead of implementing an encode() method, data store classes could have a list of default encoders (see xarray.coding) used when reading/writing data. But xarray.coding isn't quite ready for this yet...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clearly some refactoring will be needed to this once the overall backend refactoring moves forward. For now, however, this seems like a reasonable compromise.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shoyer - I'm a bit confused here. As you'll see in the Zarr backend, the encode_variable method is applying a list of encoders. Where in the WritableDataStore were you envisioning the application of the encoders?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shoyer - What would you say to merging this in its current state and leaving the encoders refactor to a separate PR? I'm happy to make more changes here but a) I'm not sure how to address your last comment, and b) I've already drifted a fair ways off track with this PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with that. We may want to change it more in the future but this is a clear improvement for now.

"""
Encode the variables and attributes in this store

Parameters
----------
variables : dict-like
Dictionary of key/value (variable name / xr.Variable) pairs
attributes : dict-like
Dictionary of key/value (attribute name / attribute) pairs

Returns
-------
variables : dict-like
attributes : dict-like

"""
variables = OrderedDict([(k, self.encode_variable(v))
for k, v in variables.items()])
attributes = OrderedDict([(k, self.encode_attribute(v))
for k, v in attributes.items()])
return variables, attributes

def encode_variable(self, v):
"""encode one variable"""
return v

def encode_attribute(self, a):
"""encode one attribute"""
return a

def set_dimension(self, d, l): # pragma: no cover
raise NotImplementedError

Expand All @@ -208,24 +242,74 @@ def sync(self):
self.writer.sync()

def store_dataset(self, dataset):
# in stores variables are all variables AND coordinates
# in xarray.Dataset variables are variables NOT coordinates,
# so here we pass the whole dataset in instead of doing
# dataset.variables
"""
in stores, variables are all variables AND coordinates
in xarray.Dataset variables are variables NOT coordinates,
so here we pass the whole dataset in instead of doing
dataset.variables
"""
self.store(dataset, dataset.attrs)

def store(self, variables, attributes, check_encoding_set=frozenset(),
unlimited_dims=None):
"""
Top level method for putting data on this store, this method:
- encodes variables/attributes
- sets dimensions
- sets variables

Parameters
----------
variables : dict-like
Dictionary of key/value (variable name / xr.Variable) pairs
attributes : dict-like
Dictionary of key/value (attribute name / attribute) pairs
check_encoding_set : list-like
List of variables that should be checked for invalid encoding
values
unlimited_dims : list-like
List of dimension names that should be treated as unlimited
dimensions.
"""

variables, attributes = self.encode(variables, attributes)

self.set_attributes(attributes)
self.set_dimensions(variables, unlimited_dims=unlimited_dims)
self.set_variables(variables, check_encoding_set,
unlimited_dims=unlimited_dims)

def set_attributes(self, attributes):
"""
This provides a centralized method to set the dataset attributes on the
data store.

Parameters
----------
attributes : dict-like
Dictionary of key/value (attribute name / attribute) pairs
"""
for k, v in iteritems(attributes):
self.set_attribute(k, v)

def set_variables(self, variables, check_encoding_set,
unlimited_dims=None):
"""
This provides a centralized method to set the variables on the data
store.

Parameters
----------
variables : dict-like
Dictionary of key/value (variable name / xr.Variable) pairs
check_encoding_set : list-like
List of variables that should be checked for invalid encoding
values
unlimited_dims : list-like
List of dimension names that should be treated as unlimited
dimensions.
"""

for vn, v in iteritems(variables):
name = _encode_variable_name(vn)
check = vn in check_encoding_set
Expand All @@ -234,23 +318,54 @@ def set_variables(self, variables, check_encoding_set,

self.writer.add(source, target)

def set_necessary_dimensions(self, variable, unlimited_dims=None):
def set_dimensions(self, variables, unlimited_dims=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a docstring or comment explaining what this method does?

This would help new developers (including myself) come onboard with backend development.

"""
This provides a centralized method to set the dimensions on the data
store.

Parameters
----------
variables : dict-like
Dictionary of key/value (variable name / xr.Variable) pairs
unlimited_dims : list-like
List of dimension names that should be treated as unlimited
dimensions.
"""
if unlimited_dims is None:
unlimited_dims = set()
dims = self.get_dimensions()
for d, l in zip(variable.dims, variable.shape):
if d not in dims:

existing_dims = self.get_dimensions()

dims = OrderedDict()
for v in unlimited_dims: # put unlimited_dims first
dims[v] = None
for v in variables.values():
dims.update(dict(zip(v.dims, v.shape)))

for d, l in dims.items():

if d in existing_dims and l != existing_dims[d]:
raise ValueError("Unable to update size for existing dimension"
"%r (%d != %d)" % (d, l, existing_dims[d]))
elif d not in existing_dims:
is_unlimited = d in unlimited_dims
self.set_dimension(d, l, is_unlimited)


class WritableCFDataStore(AbstractWritableDataStore):

def store(self, variables, attributes, *args, **kwargs):
def encode(self, variables, attributes):
# All NetCDF files get CF encoded by default, without this attempting
# to write times, for example, would fail.
cf_variables, cf_attrs = cf_encoder(variables, attributes)
AbstractWritableDataStore.store(self, cf_variables, cf_attrs,
variables, attributes = cf_encoder(variables, attributes)
variables = OrderedDict([(k, self.encode_variable(v))
for k, v in variables.items()])
attributes = OrderedDict([(k, self.encode_attribute(v))
for k, v in attributes.items()])
return variables, attributes

def store(self, variables, attributes, *args, **kwargs):
AbstractWritableDataStore.store(self, variables, attributes,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can remove this method -- the implementation is exactly the parent class method.

*args, **kwargs)


Expand Down
9 changes: 5 additions & 4 deletions xarray/backends/h5netcdf_.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@
from ..core.pycompat import iteritems, bytes_type, unicode_type, OrderedDict

from .common import WritableCFDataStore, DataStorePickleMixin, find_root
from .netCDF4_ import (_nc4_group, _nc4_values_and_dtype,
from .netCDF4_ import (_nc4_group, _encode_nc4_variable, _get_datatype,
_extract_nc4_variable_encoding, BaseNetCDF4Array)


Expand Down Expand Up @@ -126,14 +126,15 @@ def set_attribute(self, key, value):
with self.ensure_open(autoclose=False):
self.ds.setncattr(key, value)

def encode_variable(self, variable):
return _encode_nc4_variable(variable)

def prepare_variable(self, name, variable, check_encoding=False,
unlimited_dims=None):
import h5py

attrs = variable.attrs.copy()
variable, dtype = _nc4_values_and_dtype(variable)

self.set_necessary_dimensions(variable, unlimited_dims=unlimited_dims)
dtype = _get_datatype(variable)

fill_value = attrs.pop('_FillValue', None)
if dtype is str and fill_value is not None:
Expand Down
9 changes: 8 additions & 1 deletion xarray/backends/memory.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,13 @@ def get_attrs(self):
def get_variables(self):
return self._variables

def get_dimensions(self):
dims = OrderedDict()
for v in self._variables.values():
for d, s in v.dims.items():
dims[d] = s
return dims

def prepare_variable(self, k, v, *args, **kwargs):
new_var = Variable(v.dims, np.empty_like(v), v.attrs)
# we copy the variable and stuff all encodings in the
Expand All @@ -41,6 +48,6 @@ def set_attribute(self, k, v):
# copy to imitate writing to disk.
self._attributes[k] = copy.deepcopy(v)

def set_dimension(self, d, l):
def set_dimension(self, d, l, unlimited_dims=None):
# in this model, dimensions are accounted for in the variables
pass
38 changes: 23 additions & 15 deletions xarray/backends/netCDF4_.py
Original file line number Diff line number Diff line change
Expand Up @@ -74,19 +74,28 @@ def __getitem__(self, key):
return data


def _nc4_values_and_dtype(var):
def _encode_nc4_variable(var):
if var.dtype.kind == 'S':
var = conventions.maybe_encode_as_char_array(var)
return var


def _get_datatype(var, nc_format='NETCDF4'):
if nc_format == 'NETCDF4':
datatype = _nc4_dtype(var)
else:
datatype = var.dtype
return datatype


def _nc4_dtype(var):
if var.dtype.kind == 'U':
dtype = str
elif var.dtype.kind == 'S':
# use character arrays instead of unicode, because unicode support in
# netCDF4 is still rather buggy
var = conventions.maybe_encode_as_char_array(var)
dtype = var.dtype
elif var.dtype.kind in ['i', 'u', 'f', 'c']:
elif var.dtype.kind in ['i', 'u', 'f', 'c', 'S']:
dtype = var.dtype
else:
raise ValueError('cannot infer dtype for netCDF4 variable')
return var, dtype
return dtype


def _nc4_group(ds, group, mode):
Expand Down Expand Up @@ -324,18 +333,17 @@ def set_variables(self, *args, **kwargs):
with self.ensure_open(autoclose=False):
super(NetCDF4DataStore, self).set_variables(*args, **kwargs)

def prepare_variable(self, name, variable, check_encoding=False,
unlimited_dims=None):
def encode_variable(self, variable):
variable = _force_native_endianness(variable)

if self.format == 'NETCDF4':
variable, datatype = _nc4_values_and_dtype(variable)
variable = _encode_nc4_variable(variable)
else:
variable = encode_nc3_variable(variable)
datatype = variable.dtype

self.set_necessary_dimensions(variable, unlimited_dims=unlimited_dims)
return variable

def prepare_variable(self, name, variable, check_encoding=False,
unlimited_dims=None):
datatype = _get_datatype(variable, self.format)
attrs = variable.attrs.copy()

fill_value = attrs.pop('_FillValue', None)
Expand Down
9 changes: 4 additions & 5 deletions xarray/backends/scipy_.py
Original file line number Diff line number Diff line change
Expand Up @@ -181,17 +181,16 @@ def set_attribute(self, key, value):
value = encode_nc3_attr_value(value)
setattr(self.ds, key, value)

def encode_variable(self, variable):
variable = encode_nc3_variable(variable)
return variable

def prepare_variable(self, name, variable, check_encoding=False,
unlimited_dims=None):
variable = encode_nc3_variable(variable)
if check_encoding and variable.encoding:
raise ValueError('unexpected encoding for scipy backend: %r'
% list(variable.encoding))

if unlimited_dims is not None and len(unlimited_dims) > 1:
raise ValueError('NETCDF3 only supports one unlimited dimension')
self.set_necessary_dimensions(variable, unlimited_dims=unlimited_dims)

data = variable.data
# nb. this still creates a numpy array in all memory, even though we
# don't write the data yet; scipy.io.netcdf does not not support
Expand Down
Loading