open_mfdatatree #51

TomNicholas · 2021-12-16T22:40:11Z

Currently we have an open_datatree function which opens a single netcdf file (or zarr store). We could imagine an open_mfdatatree function which is analogous to open_mfdataset, which can open multiple files at once.

As DataTree has a structure essentially the same as that of a filesystem, I'm imagining a use case where the user has a bunch of data files stored in nested directories, e.g.

project
    /experimental
        data.nc
    /simulation
        /highres
            output.nc
        /lowres
            output.nc

We could look through all of these folders recursively, open any files found of the correct format, and store them in a single tree.

We could even allow for multiple data files in each folder if we called open_mfdataset on all the files found in each folder.

EDIT: We could also save a tree out to multiple folders like this using a save_mfdatatree method.

This might be particularly useful for users who want the benefit of a tree-like structure but are using a file format that doesn't support groups.

The text was updated successfully, but these errors were encountered:

castelao · 2022-09-01T01:39:32Z

In the case of save_mfdatatree, where would it save the global and group level attributes? I see two paths:

Each file preserves the upper levels. For instance, in your example, data.nc would still use groups inside it such as /experimental/data, but missing the other branches, while preserving global attributes as well as attributes for experimental.
Since attributes are relevant for all levels underneath it, the global attributes from the project would be carried to experimental, combined giving precedence for experimental when duplicated, and carried to data, giving precedence to data attributes if duplicated. For instance, data.nc would inherit the attribute Conventions from the top-level project. By doing that, the data.nc would be complete, and self-containing, without losing relevant information.

Extending a little on the second option, it could be a nice functionality to be able to extract any level in the tree without losing information. It could be a layer before actually exporting to a netCDF. If I have a DataTree object project like in your example and I'm only interested in the high-resolution output, are there already the functionality for something like project.flatten("/simulation'/highres") that preserves the upper levels, attributes, and variables?

TomNicholas · 2022-09-01T18:08:35Z

In the case of save_mfdatatree, where would it save the global and group level attributes?

Each file preserves the upper levels.

I think this is what I was imagining. That's the most direct and simple mapping between an in-memory datatree and a set of folders and .nc files.

the global attributes from the project would be carried to experimental ...

I'm hesistant to do anything that introduces "inheritance" from nodes above like this. The problem is that different group-supporting formats have different hierarchical behaviours, and so something that follows netCDF might be weird with Zarr. Ultimately the in-memory DataTree should only work one way, so a choice has to be made there (and so far I've gone for the simplest choice: independence between nodes.) That said you could imagine have a kwarg to save_mfdatatree that changes behaviour like this when saving.

If I have a DataTree object project like in your example and I'm only interested in the high-resolution output, are there already the functionality for something like project.flatten("/simulation'/highres") that preserves the upper levels, attributes, and variables?

There is no specific method for flattening parts of the tree, but we can make one! (xref #79) I'm not quite sure what you want it to do though - what type would you want project.flatten("/simulation'/highres") to return?

castelao · 2022-09-05T19:06:30Z

Sounds wise preserving the structure. I have two suggestions on that:

Keep track of where it came from. Maybe use the global attribute source in data.nc and output.nc. Possibly pointing to the id attribute of the original project. id should be unique if following ACDD-1.3
Some relevant information might be left on higher levels. In your example, let's assume that depth is common between highres and lowres, so it was stored on the simulation group level. In that case highres/output.nc and lowres/output.nc are incomplete, not self containing. One option to avoid redundancy on higher-level variables would be using External Variables attribute.

On the flattening, I was thinking on something like project.subset(["/project/hires/temperature", "/project/hires/doxy"]).squeeze(). This would extract a subset of variables and all other variables and attributes from upper levels. Then, some cases might make sense to flatten "unnecessary layers". For instance, if the outcome is time, lat, lon, and sea surface height, I might just use a flat dataset. I envision cases where it makes sense to distribute a large consistent and complete datasets, such as a full simulation and its products, or all variables measured by different sensors aboard the same satellite, or all products from a single glider mission. But from the user perspective, it is common for someone to be interested in a single branch of that hierarchical tree, and in that case, information spread on multiple levels adds unnecessary complexity.

I have no strong opinion on any of those, but just ideas.

dcherian · 2023-02-07T17:03:38Z

There is no specific method for flattening parts of the tree

Just found this: https://gitlab.eumetsat.int/open-source/netcdf-flattener/

castelao · 2023-02-08T04:21:03Z

@dcherian , thanks for pointing that out! @erget is a major contributor to the CF-Conventions and a great person to work with. Maybe there is a common interest here.

Evidlo · 2023-10-11T05:24:27Z

Our team is interested in open_mfdatatree and save_mfdatatree as well, but for the purpose of avoiding large files. Our total dataset size is hundreds of GB and so it would be nice to have a set of smaller netCDF files for each week of data.

keewis · 2024-08-13T16:51:12Z

closing in favor of pydata/xarray#9351

TomNicholas added the design question label Dec 16, 2021

TomNicholas mentioned this issue Dec 21, 2021

Having a custom engine for open_mfdatatree #55

Closed

TomNicholas mentioned this issue Jan 19, 2022

[FEATURE]: Read from/write to several NetCDF4 groups with a single file open/close operation pydata/xarray#6174

Open

TomNicholas added the IO Representation of particular file formats as trees label May 18, 2022

djhoese mentioned this issue Jan 13, 2023

[Design] Integrate xarray-datatree pytroll/satpy#2352

Open

TomNicholas mentioned this issue Jul 10, 2023

Move absolute path finder from open_mfdataset to own function pydata/xarray#7968

Merged

keewis mentioned this issue Aug 13, 2024

Add open_mfdatatree pydata/xarray#9351

Open

keewis closed this as completed Aug 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

open_mfdatatree #51

open_mfdatatree #51

TomNicholas commented Dec 16, 2021 •

edited

Loading

castelao commented Sep 1, 2022

TomNicholas commented Sep 1, 2022

castelao commented Sep 5, 2022

dcherian commented Feb 7, 2023

castelao commented Feb 8, 2023

Evidlo commented Oct 11, 2023

keewis commented Aug 13, 2024

open_mfdatatree #51

open_mfdatatree #51

Comments

TomNicholas commented Dec 16, 2021 • edited Loading

castelao commented Sep 1, 2022

TomNicholas commented Sep 1, 2022

castelao commented Sep 5, 2022

dcherian commented Feb 7, 2023

castelao commented Feb 8, 2023

Evidlo commented Oct 11, 2023

keewis commented Aug 13, 2024

TomNicholas commented Dec 16, 2021 •

edited

Loading