Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zarr as persistent store for xarray #1223

Closed
martindurant opened this issue Jan 20, 2017 · 12 comments · Fixed by #1528
Closed

zarr as persistent store for xarray #1223

martindurant opened this issue Jan 20, 2017 · 12 comments · Fixed by #1528

Comments

@martindurant
Copy link
Contributor

netCDF and HDF are good legacy archival formats handled by xarray and the wider numerical python ecosystem, but they don't play nicely with parallel access across a cluster or from an archive store like s3.
zarr is certainly non-standard, but would make a very nice internal store for intermediates.

This gist, below, is a simple motivator that we could use zarr not only for dask but for xarray too without too much expenditure of effort.
https://gist.github.com/martindurant/dc27a072da47fab8d63117488f1fd7f1

@mrocklin
Copy link
Contributor

This looks pretty cool to me. I expected it to be harder to encode xarray into zarr. Some thoughts/comments:

  1. Is it harder to encode a full xarray into zarr? Are there cases that are not covered by this example that are likely to occur in the wild (mostly a question for @shoyer)
  2. I guess one major thing missing is storing full Dataset objects rather than just DataArrays. I suspect that scientific users want to keep all of the variables and coordinates in a single artifact
  3. It would be nice to avoid using pickle if possible, so that the data could be cross-language.
  4. How open is the XArray community to adding experimental to/from_zarr methods?
  5. Eventually we probably want to do lazy_value = da.store(..., compute=False) and then compute all of them at once

@pwolfram @rabernat @jhamman

@martindurant
Copy link
Contributor Author

3: a json-like representation such as used by the hidden .xarray item would also do.

@mrocklin
Copy link
Contributor

Also cc @alimanfoo

@alimanfoo
Copy link
Contributor

alimanfoo commented Jan 21, 2017 via email

@shoyer
Copy link
Member

shoyer commented Jan 21, 2017

@martindurant thanks for posting this as an issue -- I didn't get a notification from your ping in the gist.

I agree that serializing xarray objects to zarr should be pretty straightforward and seems quite useful.

To properly handle edge cases like strange data types (e.g., datetime64 or object) and Dataset objects, we probably want to integrate this with xarray existing conventions handling and DataStore interface. This will be good motivation for me to finish up my refactor in #1087 -- right now the interface is a bit more complex than needed, and doesn't do a good job of abstracting details like whether file formats need locking.

So we could either directly write a DataStore or write a separate "znetcdf" or "netzdf" module that implements an interface similar to h5netcdf (which itself is a thin wrapper on top of h5py). All things being equal, I would prefer the later approach, because people seem to find these intermediate interfaces useful, and it would help clarify the specification of the file format vs. details of how xarray uses it.

As far as the spec goes, I agree that JSON is the sensible file format. Really, all we need on top of zarr is:

  • specified dimensions sizes, stored at the group level (Dict[str, int])
  • a list of dimension names associated with each array (List[str])
  • a small amount of validation logic to ensure that dimensions used on an array exist (on the array's group or one of its parents) and are consistent

This could make sense either as part of zarr or a separate library. I would lean towards putting it in zarr only because that would be slightly more convenient, as we could safely make use of subclassing to add the extra functionality. zarr already handles hierarchies, arrays and metadata, which is most of the hard work.

I'm certainly quite open to integrate experimental data formats like this one into xarray, but ultimately of course it depends on interest from the community. This wouldn't even necessarily need to live in xarray proper (though that would be fine, too). For example, @rabernat wrote a DataStore for loading MIT GCM outputs (https://github.com/xgcm/xmitgcm).

@martindurant
Copy link
Contributor Author

I have developed my example a little to sidestep subclassing you suggest, which seemed tricky to implement.

Please see https://gist.github.com/martindurant/06a1e98c91f0033c4649a48a2f943390
(dataset_to/from_zarr functions)

I can use the zarr groups structure to mirror at least typical use of xarrays: variables, coordinates and sets of attributes on each. I have tested this with s3 too, stealing a little code from dask to show the idea.

@alimanfoo
Copy link
Contributor

alimanfoo commented Feb 21, 2017 via email

@martindurant
Copy link
Contributor Author

martindurant commented Feb 22, 2017

@alimanfoo , in the new dataset save function, I do exactly as you suggest, with everything getting put as a dict into the main zarr group attributes, with special attribute names "attrs" for the data-set root, "coords" for the set of coordinate objects and "variables" for the set of variables objects (all of these have their own attributes in xarray).

@alimanfoo
Copy link
Contributor

alimanfoo commented Feb 22, 2017 via email

@martindurant
Copy link
Contributor Author

True, xarray_to_zarr is unchanged from before. The dataset functions could supercede, since a single xarray is just a special case of a dataset; or we could decide that for the special case it is worth having short-cut functions. I was worried about the number of metadata files being created, since on a remote system like S3, there is a large overhead to reading many small files.

@martindurant
Copy link
Contributor Author

@alimanfoo , do you think this work would make more sense as part of zarr rather than as part of xarray?

@alimanfoo
Copy link
Contributor

alimanfoo commented Feb 23, 2017 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants