-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
zarr as persistent store for xarray #1223
Comments
This looks pretty cool to me. I expected it to be harder to encode xarray into zarr. Some thoughts/comments:
|
3: a json-like representation such as used by the hidden .xarray item would also do. |
Also cc @alimanfoo |
Happy to help if there's anything to do on the zarr side.
…On Fri, 20 Jan 2017 at 23:47, Matthew Rocklin ***@***.***> wrote:
Also cc @alimanfoo <https://github.com/alimanfoo>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1223 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAq8QlwtJQ_OKOekveWuYtLmpR-caHvgks5rUUeTgaJpZM4Lp0yH>
.
|
@martindurant thanks for posting this as an issue -- I didn't get a notification from your ping in the gist. I agree that serializing xarray objects to zarr should be pretty straightforward and seems quite useful. To properly handle edge cases like strange data types (e.g., datetime64 or object) and So we could either directly write a DataStore or write a separate "znetcdf" or "netzdf" module that implements an interface similar to h5netcdf (which itself is a thin wrapper on top of h5py). All things being equal, I would prefer the later approach, because people seem to find these intermediate interfaces useful, and it would help clarify the specification of the file format vs. details of how xarray uses it. As far as the spec goes, I agree that JSON is the sensible file format. Really, all we need on top of zarr is:
This could make sense either as part of zarr or a separate library. I would lean towards putting it in zarr only because that would be slightly more convenient, as we could safely make use of subclassing to add the extra functionality. zarr already handles hierarchies, arrays and metadata, which is most of the hard work. I'm certainly quite open to integrate experimental data formats like this one into xarray, but ultimately of course it depends on interest from the community. This wouldn't even necessarily need to live in xarray proper (though that would be fine, too). For example, @rabernat wrote a DataStore for loading MIT GCM outputs (https://github.com/xgcm/xmitgcm). |
I have developed my example a little to sidestep subclassing you suggest, which seemed tricky to implement. Please see https://gist.github.com/martindurant/06a1e98c91f0033c4649a48a2f943390 I can use the zarr groups structure to mirror at least typical use of xarrays: variables, coordinates and sets of attributes on each. I have tested this with s3 too, stealing a little code from dask to show the idea. |
Just to say this is looking neat.
For storing an xarray.DataArray, do you think it would be possible to do
away with pickling up all metadata and storing in the .xarray resource?
Specifically I'm wondering if this could all be stored as attributes on the
Zarr array, with some conventions for special xarray attribute names? I'm
guessing there must be some conventions for storing all this metadata as
attributes in an HDF5 (netCDF) file, it would potentially be nice to mirror
that as much as possible?
…On Sat, Feb 11, 2017 at 10:56 PM, Martin Durant ***@***.***> wrote:
I have developed my example a little to sidestep subclassing you suggest,
which seemed tricky to implement.
Please see https://gist.github.com/martindurant/
06a1e98c91f0033c4649a48a2f943390
(dataset_to/from_zarr functions)
I can use the zarr groups structure to mirror at least typical use of
xarrays: variables, coordinates and sets of attributes on each. I have
tested this with s3 too, stealing a little code from dask to show the idea.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1223 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAq8QtydMLiMvgETYyaVF5D1CLb-4ot4ks5rbjy5gaJpZM4Lp0yH>
.
--
Alistair Miles
Head of Epidemiological Informatics
Centre for Genomics and Global Health <http://cggh.org>
The Wellcome Trust Centre for Human Genetics
Roosevelt Drive
Oxford
OX3 7BN
United Kingdom
Email: alimanfoo@googlemail.com
Web: http://purl.org/net/aliman
Twitter: https://twitter.com/alimanfoo
Tel: +44 (0)1865 287721
|
@alimanfoo , in the new dataset save function, I do exactly as you suggest, with everything getting put as a dict into the main zarr group attributes, with special attribute names "attrs" for the data-set root, "coords" for the set of coordinate objects and "variables" for the set of variables objects (all of these have their own attributes in xarray). |
Yep, that looks good. I was wondering about the xarray_to_zarr() function?
…On Wednesday, February 22, 2017, Martin Durant ***@***.***> wrote:
@alimanfoo <https://github.com/alimanfoo> , in the new dataset save
function, I do exactly [as you suggest] (https://gist.github.com/
martindurant/06a1e98c91f0033c4649a48a2f943390#file-zarr_xarr-py-L168),
with everything getting put as a dict into the main zarr group attributes,
with special attribute names "attrs" for the data-set root, "coords" for
the set of coordinate objects and "variables" for the set of variables
objects (all of these have their own attributes in xarray).
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1223 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAq8QqSXNzQkrR0xOhhcp9QxWUIkz8Teks5rfKvggaJpZM4Lp0yH>
.
--
Alistair Miles
Head of Epidemiological Informatics
Centre for Genomics and Global Health <http://cggh.org>
The Wellcome Trust Centre for Human Genetics
Roosevelt Drive
Oxford
OX3 7BN
United Kingdom
Email: alimanfoo@googlemail.com
Web: http://purl.org/net/aliman
Twitter: https://twitter.com/alimanfoo
Tel: +44 (0)1865 287721
|
True, xarray_to_zarr is unchanged from before. The dataset functions could supercede, since a single xarray is just a special case of a dataset; or we could decide that for the special case it is worth having short-cut functions. I was worried about the number of metadata files being created, since on a remote system like S3, there is a large overhead to reading many small files. |
@alimanfoo , do you think this work would make more sense as part of zarr rather than as part of xarray? |
FWIW I think it would be better in xarray or a separate package, at least
at the moment, just because I don't have a lot of time right now for OSS
and need to keep Zarr as lean as possible.
…On Thursday, February 23, 2017, Martin Durant ***@***.***> wrote:
@alimanfoo <https://github.com/alimanfoo> , do you think this work would
make more sense as part of zarr rather than as part of xarray?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1223 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAq8QoeCQOn7WvB8gtLP5Bs6cifIKRQiks5rfYjSgaJpZM4Lp0yH>
.
--
Alistair Miles
Head of Epidemiological Informatics
Centre for Genomics and Global Health <http://cggh.org>
The Wellcome Trust Centre for Human Genetics
Roosevelt Drive
Oxford
OX3 7BN
United Kingdom
Email: alimanfoo@googlemail.com
Web: http://purl.org/net/aliman
Twitter: https://twitter.com/alimanfoo
Tel: +44 (0)1865 287721
|
netCDF and HDF are good legacy archival formats handled by xarray and the wider numerical python ecosystem, but they don't play nicely with parallel access across a cluster or from an archive store like s3.
zarr is certainly non-standard, but would make a very nice internal store for intermediates.
This gist, below, is a simple motivator that we could use zarr not only for dask but for xarray too without too much expenditure of effort.
https://gist.github.com/martindurant/dc27a072da47fab8d63117488f1fd7f1
The text was updated successfully, but these errors were encountered: