Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Performance improvements for zarr backend #1800

Merged
merged 38 commits into from
Jan 24, 2018
Merged
Show file tree
Hide file tree
Changes from 16 commits
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
a0bea98
move backend append logic to the prepare_variable methods
Dec 21, 2017
afdb254
deprecate variables/dimensions/attrs properties on AbstractWritableDa…
Dec 22, 2017
cc02150
warnings instead of errors for backend properties
Dec 24, 2017
86240cd
use attrs.update when setting zarr attributes
Dec 26, 2017
9c89ef2
more performance improvements to attributes in zarr backend
Dec 26, 2017
2568d21
Merge branch 'master' of github.com:pydata/xarray into fix/zarr_set_a…
Dec 28, 2017
d459c66
fix typo
Dec 28, 2017
c59ca57
Merge branch 'master' of github.com:pydata/xarray into fix/zarr_set_a…
Dec 28, 2017
2dd186a
Merge branch 'fix/zarr_set_attrs' of github.com:jhamman/xarray into f…
Dec 28, 2017
8f71b31
new set_dimensions method for writable data stores
Jan 2, 2018
07b9c21
Merge branch 'fix/zarr_set_attrs' of github.com:jhamman/xarray into f…
Jan 2, 2018
67fcd92
more fixes for zarr
Jan 2, 2018
b38e1a6
more tests for zarr and remove append logic for zarr
Jan 2, 2018
47ba8b6
more tests for zarr and remove append logic for zarr
Jan 2, 2018
9152b12
Merge branch 'fix/zarr_set_attrs' of github.com:jhamman/xarray into f…
Jan 2, 2018
26b6bcb
a few more tweaks to zarr attrs
Jan 2, 2018
b7681ae
Add encode methods to writable data stores, fixes for Zarr tests
Jan 4, 2018
e084e9e
fix for InMemoryDataStore
Jan 5, 2018
a6aeb36
fix for unlimited dimensions Scipy Datastores
Jan 5, 2018
264b13f
another patch for scipy
Jan 5, 2018
9c03bfc
whatsnew
Jan 6, 2018
c92020a
ordereddict
Jan 7, 2018
18434f9
address some of rabernats comments, in particular, this commit remove…
Jan 9, 2018
9f89c7c
stop skipping zero-dim zarr tests
Jan 9, 2018
3590d28
update minimum zarr version for tests
Jan 9, 2018
69cacee
Merge branch 'master' into fix/zarr_set_attrs
Jan 9, 2018
8d744e0
Merge branch 'master' into fix/zarr_set_attrs
Jan 9, 2018
a8dabdf
Merge branch 'master' of github.com:pydata/xarray into fix/zarr_set_a…
Jan 10, 2018
7858db7
Merge branch 'fix/zarr_set_attrs' of github.com:jhamman/xarray into f…
Jan 10, 2018
48bf7ef
Merge branch 'master' of github.com:pydata/xarray into fix/zarr_set_a…
Jan 11, 2018
53260c9
Merge branch 'master' of github.com:pydata/xarray into fix/zarr_set_a…
Jan 12, 2018
7ed6bf8
cleanup and docs for zarr performance branch
Jan 13, 2018
3872da2
fix two failing tests when using zarr master
Jan 14, 2018
e6b7068
Merge branch 'master' of github.com:pydata/xarray into fix/zarr_set_a…
Jan 14, 2018
c31decf
flake8
Jan 15, 2018
189d262
back to zarr 2.2
Jan 15, 2018
07b92e2
Merge branch 'master' of github.com:pydata/xarray into fix/zarr_set_a…
Jan 15, 2018
96996ef
remove extra store method
Jan 16, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 20 additions & 6 deletions xarray/backends/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,10 +6,10 @@
import time
import traceback
import contextlib
from collections import Mapping
from collections import Mapping, OrderedDict
import warnings

from ..conventions import cf_encoder
from ..conventions import cf_encoder, maybe_encode_as_char_array
from ..core import indexing
from ..core.utils import FrozenOrderedDict, NdimSizeLenMixin
from ..core.pycompat import iteritems, dask_array_type
Expand Down Expand Up @@ -216,7 +216,11 @@ def store_dataset(self, dataset):

def store(self, variables, attributes, check_encoding_set=frozenset(),
unlimited_dims=None):
# This seems out of place
variables = OrderedDict([(k, maybe_encode_as_char_array(v))
for k, v in variables.items()])
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should each backend define encode/decode methods? For string variables, we need to encode before setting the dimensions.

self.set_attributes(attributes)
self.set_dimensions(variables, unlimited_dims=unlimited_dims)
self.set_variables(variables, check_encoding_set,
unlimited_dims=unlimited_dims)

Expand All @@ -234,12 +238,22 @@ def set_variables(self, variables, check_encoding_set,

self.writer.add(source, target)

def set_necessary_dimensions(self, variable, unlimited_dims=None):
def set_dimensions(self, variables, unlimited_dims=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a docstring or comment explaining what this method does?

This would help new developers (including myself) come onboard with backend development.

if unlimited_dims is None:
unlimited_dims = set()
dims = self.get_dimensions()
for d, l in zip(variable.dims, variable.shape):
if d not in dims:

existing_dims = self.get_dimensions()

dims = {}
for v in variables.values():
dims.update(dict(zip(v.dims, v.shape)))

for d, l in dims.items():

if d in existing_dims and l != existing_dims[d]:
raise ValueError("Unable to update size for existing dimension"
"%r (%d != %d)" % (d, l, existing_dims[d]))
elif d not in existing_dims:
is_unlimited = d in unlimited_dims
self.set_dimension(d, l, is_unlimited)

Expand Down
2 changes: 0 additions & 2 deletions xarray/backends/h5netcdf_.py
Original file line number Diff line number Diff line change
Expand Up @@ -133,8 +133,6 @@ def prepare_variable(self, name, variable, check_encoding=False,
attrs = variable.attrs.copy()
variable, dtype = _nc4_values_and_dtype(variable)

self.set_necessary_dimensions(variable, unlimited_dims=unlimited_dims)

fill_value = attrs.pop('_FillValue', None)
if dtype is str and fill_value is not None:
raise NotImplementedError(
Expand Down
2 changes: 0 additions & 2 deletions xarray/backends/netCDF4_.py
Original file line number Diff line number Diff line change
Expand Up @@ -334,8 +334,6 @@ def prepare_variable(self, name, variable, check_encoding=False,
variable = encode_nc3_variable(variable)
datatype = variable.dtype

self.set_necessary_dimensions(variable, unlimited_dims=unlimited_dims)

attrs = variable.attrs.copy()

fill_value = attrs.pop('_FillValue', None)
Expand Down
4 changes: 0 additions & 4 deletions xarray/backends/scipy_.py
Original file line number Diff line number Diff line change
Expand Up @@ -188,10 +188,6 @@ def prepare_variable(self, name, variable, check_encoding=False,
raise ValueError('unexpected encoding for scipy backend: %r'
% list(variable.encoding))

if unlimited_dims is not None and len(unlimited_dims) > 1:
raise ValueError('NETCDF3 only supports one unlimited dimension')
self.set_necessary_dimensions(variable, unlimited_dims=unlimited_dims)

data = variable.data
# nb. this still creates a numpy array in all memory, even though we
# don't write the data yet; scipy.io.netcdf does not not support
Expand Down
105 changes: 29 additions & 76 deletions xarray/backends/zarr.py
Original file line number Diff line number Diff line change
Expand Up @@ -43,13 +43,8 @@ def _ensure_valid_fill_value(value, dtype):
return _encode_zarr_attr_value(valid)


def _decode_zarr_attr_value(value):
return value


def _decode_zarr_attrs(attrs):
return OrderedDict([(k, _decode_zarr_attr_value(v))
for k, v in attrs.items()])
return OrderedDict(attrs.asdict())


def _replace_slices_with_arrays(key, shape):
Expand Down Expand Up @@ -297,9 +292,6 @@ def __init__(self, zarr_group, writer=None):
raise KeyError("Zarr group can't be read by xarray because "
"it is missing the `%s` attribute." %
_DIMENSION_KEY)
else:
# initialize hidden dimension attribute
self.ds.attrs[_DIMENSION_KEY] = {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the removal of these lines is relevant for the dimension key test error.


if writer is None:
# by default, we should not need a lock for writing zarr because
Expand All @@ -315,7 +307,7 @@ def open_store_variable(self, name, zarr_array):
data = indexing.LazilyIndexedArray(ZarrArrayWrapper(name, self))
dimensions, attributes = _get_zarr_dims_and_attrs(zarr_array,
_DIMENSION_KEY)
attributes = _decode_zarr_attrs(attributes)
attributes = _decode_zarr_attrs(attributes.asdict())
encoding = {'chunks': zarr_array.chunks,
'compressor': zarr_array.compressor,
'filters': zarr_array.filters}
Expand All @@ -331,29 +323,33 @@ def get_variables(self):
for k, v in self.ds.arrays())

def get_attrs(self):
_, attributes = _get_zarr_dims_and_attrs(self.ds, _DIMENSION_KEY)
attributes = HiddenKeyDict(self.ds.attrs.asdict(), [_DIMENSION_KEY])
return _decode_zarr_attrs(attributes)

def get_dimensions(self):
dimensions, _ = _get_zarr_dims_and_attrs(self.ds, _DIMENSION_KEY)
try:
dimensions = self.ds.attrs[_DIMENSION_KEY].asdict()
except KeyError:
raise KeyError("Zarr object is missing the attribute `%s`, which "
"is required for xarray to determine variable "
"dimensions." % (_DIMENSION_KEY))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be raising when you delete the dimension key.

return dimensions

def set_dimension(self, name, length, is_unlimited=False):
if is_unlimited:
def set_dimensions(self, variables, unlimited_dims=None):
if unlimited_dims is not None:
raise NotImplementedError(
"Zarr backend doesn't know how to handle unlimited dimensions")
# consistency check
if name in self.ds.attrs[_DIMENSION_KEY]:
if self.ds.attrs[_DIMENSION_KEY][name] != length:
raise ValueError("Pre-existing array dimensions %r "
"encoded in Zarr attributes are incompatible "
"with newly specified dimension `%s`: %g" %
(self.ds.attrs[_DIMENSION_KEY], name, length))
self.ds.attrs[_DIMENSION_KEY][name] = length

def set_attribute(self, key, value):
_, attributes = _get_zarr_dims_and_attrs(self.ds, _DIMENSION_KEY)
attributes[key] = _encode_zarr_attr_value(value)

dims = {}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make this an OrderedDict

for v in variables.values():
dims.update(dict(zip(v.dims, v.shape)))

self.ds.attrs.update({_DIMENSION_KEY: dims})

def set_attributes(self, attributes):
encoded_attrs = OrderedDict((k, _encode_zarr_attr_value(v))
for k, v in iteritems(attributes))
self.ds.attrs.put(encoded_attrs)

def prepare_variable(self, name, variable, check_encoding=False,
unlimited_dims=None):
Expand All @@ -363,41 +359,21 @@ def prepare_variable(self, name, variable, check_encoding=False,
dtype = variable.dtype
shape = variable.shape

# TODO: figure out how zarr should deal with unlimited dimensions
self.set_necessary_dimensions(variable, unlimited_dims=unlimited_dims)

fill_value = _ensure_valid_fill_value(attrs.pop('_FillValue', None),
dtype)

# TODO: figure out what encoding is needed for zarr
encoding = _extract_zarr_variable_encoding(
variable, raise_on_invalid=check_encoding)

# arguments for zarr.create:
# zarr.creation.create(shape, chunks=None, dtype=None,
# compressor='default', fill_value=0, order='C', store=None,
# synchronizer=None, overwrite=False, path=None, chunk_store=None,
# filters=None, cache_metadata=True, **kwargs)
if name in self.ds:
zarr_array = self.ds[name]
else:
zarr_array = self.ds.create(name, shape=shape, dtype=dtype,
fill_value=fill_value, **encoding)
# decided not to explicity enumerate encoding options because we
# risk overriding zarr's defaults (e.g. if we specificy
# cache_metadata=None instead of True). Alternative is to have lots of
# logic in _extract_zarr_variable encoding to duplicate zarr defaults.
# chunks=encoding.get('chunks'),
# compressor=encoding.get('compressor'),
# filters=encodings.get('filters'),
# cache_metadata=encoding.get('cache_metadata'))

encoded_attrs = OrderedDict()
# the magic for storing the hidden dimension data
zarr_array.attrs[_DIMENSION_KEY] = dims
_, attributes = _get_zarr_dims_and_attrs(zarr_array, _DIMENSION_KEY)

encoded_attrs[_DIMENSION_KEY] = dims
for k, v in iteritems(attrs):
attributes[k] = _encode_zarr_attr_value(v)
encoded_attrs[k] = _encode_zarr_attr_value(v)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe leave this as a #TODO?


zarr_array = self.ds.create(name, shape=shape, dtype=dtype,
fill_value=fill_value, **encoding)
zarr_array.attrs.put(encoded_attrs)

return zarr_array, variable.data

Expand All @@ -406,29 +382,6 @@ def store(self, variables, attributes, *args, **kwargs):
for k, v in iteritems(variables))
AbstractWritableDataStore.store(self, new_vars, attributes,
*args, **kwargs)
# sync() and close() methods should not be needed with zarr


# from zarr docs

# Zarr arrays can be used as either the source or sink for data in parallel
# computations. Both multi-threaded and multi-process parallelism are
# supported. The Python global interpreter lock (GIL) is released for both
# compression and decompression operations, so Zarr will not block other Python
# threads from running.
#
# A Zarr array can be read concurrently by multiple threads or processes. No
# synchronization (i.e., locking) is required for concurrent reads.
#
# A Zarr array can also be written to concurrently by multiple threads or
# processes. Some synchronization may be required, depending on the way the
# data is being written.

# If each worker in a parallel computation is writing to a separate region of
# the array, and if region boundaries are perfectly aligned with chunk
# boundaries, then no synchronization is required. However, if region and chunk
# boundaries are not perfectly aligned, then synchronization is required to
# avoid two workers attempting to modify the same chunk at the same time.


def open_zarr(store, group=None, synchronizer=None, auto_chunk=True,
Expand Down
10 changes: 10 additions & 0 deletions xarray/tests/test_backends.py
Original file line number Diff line number Diff line change
Expand Up @@ -725,6 +725,16 @@ def test_append_overwrite_values(self):
with self.open(tmp_file) as actual:
self.assertDatasetIdentical(data, actual)

def test_append_with_invalid_dim_raises(self):
data = create_test_data()
with create_tmp_file(allow_cleanup_failure=False) as tmp_file:
self.save(data, tmp_file, mode='w')
data['var9'] = data['var2'] * 3
data = data.isel(dim1=slice(2, 6)) # modify one dimension
with raises_regex(ValueError,
'Unable to update size for existing dimension'):
self.save(data, tmp_file, mode='a')

def test_vectorized_indexing(self):
self._test_vectorized_indexing(vindex_support=False)

Expand Down