-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Zarr backend #1528
WIP: Zarr backend #1528
Conversation
Sorry that I let this slide - there was not a huge upswell of interest around what I had done, and I was not ready to dive into xarray internals. As in this comment I have come to the realisation that although nice to/from zarr methods can be made relatively easily, they will not get traction unless they can be put within a class that mimics the existing xarray infrastructure, i.e., the user would never know, except that magically they have extra encoding/compression options, the file-path can be an S3 URL (say), and dask parallel computation suddenly works on a cluster and/or with out-of-core processing. |
Your functions are a great proof of concept for the relative ease of interoperability between xarray and zarr. What I have done here is to implement an xarray "backend" (i.e. DataStore) that uses zarr as its storage medium. This puts zarr on the same level as netCDF and HDF5 as a "first class" storage format for xarray data, as suggested by @shoyer in the comment on that thread. My hope is that this will enable the magical performance benefits that you have anticipated. Digging deeper into that thread, I see @shoyer makes the following proposition:
With this PR, I have started to do the former (write a DataStore). However, I can already see the wisdom of what he says next:
I have already implemented my own custom DataStore for a different project, so I felt comfortable diving into this. But I might end up reinventing the wheel several times over if I continue down this road. In particular, I can see that my On the other hand, zarr is so simple to use that a separate wrapper package might be overkill. So I am still not sure whether the approach I am taking here is worth pursuing further. I consider this a highly experimental PR, and I'm really looking for feedback. |
This is also part of my goal. I think all the metadata can be stored internally to zarr via attributes. There just have to be some "special" attributes that xarray hides from the user. This is the same as h5netcdf. @alimanfoo suggested this should be possible in that earlier thread:
|
@rabernat : on actually looking through your code :) Happy to see you doing exactly as I felt I was not knowledgeable to do and poke xarray's guts. If I can help in any way, please let me know, although I don't have a lot of spare hours right now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is slightly easier than than I thought, to be honest :). Given the simplicity of the spec on top of zarr (just adding dimensions), we probably don't need the separate wrapper -- we should just describe it well in the docs.
xarray/backends/zarr.py
Outdated
# the first question is whether it should be based on BaseNetCDF4Array or | ||
# NdimSizeLenMixing? | ||
|
||
# or maybe we don't need wrappers at all? probably not true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually think we probably don't need a wrapper at all -- zarr already defines all these attributes!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This time around I did add the wrapper.
xarray/backends/zarr.py
Outdated
first_chunk = all_chunks.next() | ||
for this_chunk in all_chunks: | ||
if not (this_chunk == first_chunk): | ||
raise ValueError("zarr requires uniform chunk sizes, found %s" % |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calling rechunk()
to make chunks uniform might be more user friendly here.
Note that zarr does allow chunks that overlap the edge of the array (i.e., the last chunk of a dask array). This use case might be important when storing arrays with unusual dimension sizes (e.g., prime numbers).
xarray/core/utils.py
Outdated
yield k | ||
|
||
def __len__(self): | ||
return len(list(self.__iter__())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would certainly try to use len(self._data)
here rather than iteration so this is still constant time (in practice it probably doesn't matter, though).
xarray/core/utils.py
Outdated
Acts like a normal dictionary, but hides certain keys. | ||
''' | ||
# ``__init__`` method required to create instance from class. | ||
def __init__(self, data, *hidden_keys): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I prefer avoiding *args
-- it gives more freedom to adjust APIs later (e.g., by adding keyword arguments)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you suggest the best way to tell whether an argument is a string or list of strings? This is something I always need to do but don't know the "correct" pythonic way to do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you know it is an iterable, isinstance(var, basestring)
should do.
xarray/backends/zarr.py
Outdated
""" | ||
|
||
# need some special secret attributes to tell us the dimensions | ||
_dimension_key = '_XARRAY_DIMENSIONS' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: should be _DIMENSION_KEY
since it's a constant.
Also: maybe better to pick something more generic for the constant value, perhaps '_ARRAY_DIMENSIONS'
?
xarray/backends/zarr.py
Outdated
|
||
def __init__(self, store=None, overwrite=False, chunk_store=None, | ||
synchronizer=None, path=None, writer=None, autoclose=False): | ||
opener = functools.partial(_open_zarr_group, store, overwrite, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's try to follow something closer to the model for NetCDFDataStore
that I suggest over in #1508:
open
classmethod constructs the backend object for typical use cases (e.g., from a file)__init__
just wraps an existing zarr group.
This preserves a little bit more flexibility for downstream users.
Yes, probably, if we want to handle netcdf conventions for times, fill values and scaling.
This would be nice! But it's also a bigger issue (will look for the number, I think it's already been opened).
Still need to think about this one.
I guess we can ignore them (maybe add a warning?) -- they're not part of the zarr data model.
I don't think we need any autoclose logic at all -- zarr doesn't leave open files hanging around already. |
Is the goal here to be able to round-trip the file, such that calling I don't understand how encoding interacts with attributes? When is something an attribute vs. an encoding (
Does this mean that my Regarding encoding, zarr has its own internal mechanism for encoding, which it calls "filters", that closely resemble some of the CF encoding options. For example the I don't yet understand how to make these elements work together properly, for example, do avoid applying the scale / offset function twice, as I mentioned above. |
I am now trying to understand the backend test suite structure. Can someone explain to me why so many tests are skipped? For example, if I run
I get
Those line numbers refer to all of the skipped methods. Why should I need pynio to run those tests? It looks like the same thing is happening on travis: https://travis-ci.org/pydata/xarray/jobs/268805771#L1527 Maybe @pwolfram understands this stuff? |
Yes, exactly.
Typically, we store things in encoding that are attributes on the underlying NetCDF file, but no longer make sense to describe the decoded data. For example:
Currently, we assume that stores never do this, and always handle it ourselves. We might need a special exception for zarr and scale/offset encoding.
Maybe, though again it will probably need slightly customized conventions for writing data (if we let zarr handling scale/offset encoding).
We have two options:
I think (2) would be the preferred way to do this. |
Following this with interest. Regarding autoclose, just to confirm that zarr doesn't really have any notion of whether something is open or closed. When using the DirectoryStore storage class (most common use case I imagine), all files are automatically closed, nothing is kept open. There are some storage classes (e.g., ZipStore) that do require an explicit close call to finalise the file on disk if you have been writing data, but I think you can ignore this in xarray and leave it up to the user to manage this themselves. Out of interest, @shoyer do you still think there would be value in writing a wrapper for zarr analogous to h5netcdf? Or does this PR provide all the necessary functionality? |
Worth pointing out here, that the zarr filter-set is extensible (I suppose hdf5 is too, but I don't think this is ever done in practice), but I don't think it makes any particular claims to performance. I think both of the options above are reasonable, and there is no particular reason to exclude either: a zarr variable could look to xarray like floats but actually be stored as ints (i.e., arguments are passed to zarr), or it could look like ints which xarray expects to inflate to floats (i.e., stored as an attribute). I mean, if a user stores a float variable, but includes kwargs to zarr for scale/filter (or any other filter arguments), we should make no attempt to interrupt that. The only question is, if the user wishes to apply scale/offset in xarray, which is their most likely intention? I would guess the latter, compute in xarray and use attributes, since xarray users probably don't know about zarr and its filters. |
A further rather big advantage in zarr that I'm not aware of in cdf/hdf (I may be wrong) is not just null values, but not having a given block be written to disc at all if it only contains null data. This probably meshes perfectly well with most user's understanding of missing data/fill value. |
FWIW all filter (codec) classes have been migrated from zarr to a separate packaged called numcodecs and will be imported from there in the next (2.2) zarr release. Here is FixedScaleOffset. Implementation is basic numpy, probably some room for optimization. |
One path forward for now would be to ignore the filters like If we think there is an advantage to using the zarr native filters, that could be added via a future PR once we have the basic backend working. @alimanfoo: when do you anticipate the 2.2 zarr release to happen? Will the API change significantly? If so, I will wait for that to move forward here. |
The only advantage here would be for non-xarray users, who could use zarr to do this decoding/encoding automatically. For what it's worth, the implementation of scale offsets in xarray looks basically equivalent to what's done in zarr. I don't think there's a performance difference either way.
If you use chunks, I believe HDF5/NetCDF4 do the same thing, e.g.,
(Note the same file-size) |
On Tuesday, August 29, 2017, Ryan Abernathey ***@***.***> wrote:
@alimanfoo <https://github.com/alimanfoo>: when do you anticipate the 2.2
zarr release to happen? Will the API change significantly? If so, I will
wait for that to move forward here.
Zarr 2.2 will hopefully happen some time in the next 2 months, but it will
be fully backwards-compatible, no breaking API changes.
…--
Alistair Miles
Head of Epidemiological Informatics
Centre for Genomics and Global Health <http://cggh.org>
Big Data Institute Building
Old Road Campus
Roosevelt Drive
Oxford
OX3 7LF
United Kingdom
Phone: +44 (0)1865 743596
Email: alimanfoo@googlemail.com
Web: http://a <http://purl.org/net/aliman>limanfoo.github.io/
Twitter: https://twitter.com/alimanfoo
|
@rabernat , is there anything I can do to help push this along? |
I am stuck on figuring out how to develop a new test case for this. (It doesn't help that #1531 is messing up the backend tests.) If @shoyer can give us a few hints about how to best implement a test class (i.e. what to subclass, etc.), I think that could jumpstart testing and move the PR forward. I welcome contributions from others such as @martindurant on this. I won't have much time in the near future, since a new semester just dropped on me like a load of bricks. |
@rabernat indeed, the backend tests are not terribly well organized right now. Probably the place to start is to inherit from xarray/xarray/tests/test_backends.py Lines 1271 to 1279 in 98a05f1
|
@shoyer , is martindurant@6c1fb6b a reasonable start ? |
@martindurant: I may have some time to get back to working on this next week. (Especially if @jhamman can help me sort out the backend testing.) What is the status of your branch? |
@@ -184,7 +185,7 @@ def sync(self): | |||
import dask.array as da | |||
import dask | |||
if LooseVersion(dask.__version__) > LooseVersion('0.8.1'): | |||
da.store(self.sources, self.targets, lock=GLOBAL_LOCK) | |||
da.store(self.sources, self.targets, lock=self.lock) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this looks good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just gave this another spin. I think its more than ready to be merged as an experimental feature. We should be able to quickly iterate on some of the smaller issues/features via follow-on PRs.
As a minor point to complement what Matthew and Alistair have already said, one can pretty easily |
doc/io.rst
Outdated
~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
It is possible to read and write xarray datasets directly from / to cloud | ||
storage buckets using zarr. This example uses the `gcsfs`_ pacakge to provide |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pacakge -> package
Thanks for the tremendous work @rabernat , looking forward to testing this! In the future it would be nice to shortly describe the advantages of zarr over netcdf for new users. A speed benchmark could help, too! This can be done once the backend has more maturity, and when we will refactor the I/O docs |
Will merge later today if no further comments. |
woohoo, thank you Ryan! |
Question: how would one build a zarr-xarray dataset? With zarr you can open an array that contains no data, and use set-slice notation to fill in the values (which is what dask's store essentially does). If I have some pre-known coordinates and bigger-than-memory data arrays, how would I go about getting the values into the zarr structure? If this can't be done directly with the xarray interface, is there a way to call zarr's open/create/zeros such that the corresponding array will appear as a variable when the same dataset is opened with xarray? |
Does the to_zarr method suffice:
http://xarray.pydata.org/en/latest/generated/xarray.Dataset.to_zarr.html#xarray.Dataset.to_zarr
?
…On Sun, Feb 11, 2018 at 6:35 PM, Martin Durant ***@***.***> wrote:
Question: how would one *build* a zarr-xarray dataset?
With zarr you can open an array that contains no data, and use set-slice
notation to fill in the values (which is what dask's store essentially
does).
If I have some pre-known coordinates and bigger-than-memory data arrays,
how would I go about getting the values into the zarr structure? If this
can't be done directly with the xarray interface, is there a way to call
zarr's open/create/zeros such that the corresponding array will appear as a
variable when the same dataset is opened with xarray?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1528 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AASszIWtzhFRhlOoLnRJiQrTubrDuQ0Xks5tT3lIgaJpZM4PDrlp>
.
|
@martindurant - If I understand your question correctly, I think you should be able to follow a pretty standard xarray workflow: ds = xr.Dataset()
ds['your_varname'] = xr.DataArray(some_dask_array,
dims=['dimname0', 'dimname1', ...],
coords=dict_of_preknown_coords)
# repeat for each variable you want in your dataset
ds.to_zarr(some_zarr_store)
# then to open
ds2 = xr.open_zarr(some_zarr_store) Two things to note:
|
@jhamman , that partially solves what I mean, I can probably turn my data into dask arrays with some difficulty; but really I was hoping for something like the following:
and expect to be able to set the values of the new variable in the same way that you can with the equivalent zarr array. I can probably get around this by setting the values with |
@martindurant that could probably be addressed most cleanly by improving |
See dask/dask#2000 for the dask issue. Once this works in dask it should be quite easy to implement in xarray, too. |
It might be enough, in this case, to provide some helper function in zarr to create and fetch arrays that will show up as variables in xarray - this need not be specific to being used via dask. I am assuming with the work done in this PR, that there is an unambiguous way to determine if a zarr group can be interpreted as an xarray dataset, and that zarr then knows how to add things that look like variables (which generally in the zarr case don't involve writing any actual data until the parts of the array are filled in). |
So Zarr supports storing structured arrays. Maybe that’s what you are looking for, @martindurant? Would suggest using the latest 2.2.0 RC though as it fixed a few issues in this regard (particularly with NumPy 1.14). |
OK, so the way to do this in pure-zarr appears to be to simply create the appropriate zarr array and set it's dimensions attribute:
|
I'm enjoying this discussion. Zarr offers lots of new possibilities for appending / updating datasets that we should try to support. I personally would really like to be able to append / extend existing arrays from within xarray. |
Yeah, ideally when adding a variable like
we should be able to apply an optimization strategy in which the zarr array is created without filling in all those unnecessary zeros. This seems doable. On the other hand, implementing
(which cannot be done currently with dask-arrays at all), in such a way that only partitions with data get updated - this seems really hard. |
git diff upstream/master | flake8 --diff
whats-new.rst
for all changes andapi.rst
for new APII think that a zarr backend could be the ideal storage format for xarray datasets, overcoming many of the frustrations associated with netcdf and enabling optimal performance on cloud platforms.
This is a very basic start to implementing a zarr backend (as proposed in #1223); however, I am taking a somewhat different approach. I store the whole dataset in a single zarr group. I encode the extra metadata needed by xarray (so far just dimension information) as attributes within the zarr group and child arrays. I hide these special attributes from the user by wrapping the attribute dictionaries in a "
HiddenKeyDict
", so that they can't be viewed or modified.I have no tests yet (:flushed:), but the following code works.
There is a very long way to go here, but I thought I would just get a PR started. Some questions that would help me move forward.
open_dataset
? Or do we need a new method? I think.to_zarr()
would be quite useful.