Skip to content
This repository has been archived by the owner on Oct 24, 2024. It is now read-only.

open_mfdatatree #51

Closed
TomNicholas opened this issue Dec 16, 2021 · 7 comments
Closed

open_mfdatatree #51

TomNicholas opened this issue Dec 16, 2021 · 7 comments
Labels
design question IO Representation of particular file formats as trees

Comments

@TomNicholas
Copy link
Member

TomNicholas commented Dec 16, 2021

Currently we have an open_datatree function which opens a single netcdf file (or zarr store). We could imagine an open_mfdatatree function which is analogous to open_mfdataset, which can open multiple files at once.

As DataTree has a structure essentially the same as that of a filesystem, I'm imagining a use case where the user has a bunch of data files stored in nested directories, e.g.

project
    /experimental
        data.nc
    /simulation
        /highres
            output.nc
        /lowres
            output.nc

We could look through all of these folders recursively, open any files found of the correct format, and store them in a single tree.

We could even allow for multiple data files in each folder if we called open_mfdataset on all the files found in each folder.

EDIT: We could also save a tree out to multiple folders like this using a save_mfdatatree method.

This might be particularly useful for users who want the benefit of a tree-like structure but are using a file format that doesn't support groups.

@castelao
Copy link

castelao commented Sep 1, 2022

In the case of save_mfdatatree, where would it save the global and group level attributes? I see two paths:

  • Each file preserves the upper levels. For instance, in your example, data.nc would still use groups inside it such as /experimental/data, but missing the other branches, while preserving global attributes as well as attributes for experimental.
  • Since attributes are relevant for all levels underneath it, the global attributes from the project would be carried to experimental, combined giving precedence for experimental when duplicated, and carried to data, giving precedence to data attributes if duplicated. For instance, data.nc would inherit the attribute Conventions from the top-level project. By doing that, the data.nc would be complete, and self-containing, without losing relevant information.

Extending a little on the second option, it could be a nice functionality to be able to extract any level in the tree without losing information. It could be a layer before actually exporting to a netCDF. If I have a DataTree object project like in your example and I'm only interested in the high-resolution output, are there already the functionality for something like project.flatten("/simulation'/highres") that preserves the upper levels, attributes, and variables?

@TomNicholas
Copy link
Member Author

In the case of save_mfdatatree, where would it save the global and group level attributes?

Each file preserves the upper levels.

I think this is what I was imagining. That's the most direct and simple mapping between an in-memory datatree and a set of folders and .nc files.

the global attributes from the project would be carried to experimental ...

I'm hesistant to do anything that introduces "inheritance" from nodes above like this. The problem is that different group-supporting formats have different hierarchical behaviours, and so something that follows netCDF might be weird with Zarr. Ultimately the in-memory DataTree should only work one way, so a choice has to be made there (and so far I've gone for the simplest choice: independence between nodes.) That said you could imagine have a kwarg to save_mfdatatree that changes behaviour like this when saving.

If I have a DataTree object project like in your example and I'm only interested in the high-resolution output, are there already the functionality for something like project.flatten("/simulation'/highres") that preserves the upper levels, attributes, and variables?

There is no specific method for flattening parts of the tree, but we can make one! (xref #79) I'm not quite sure what you want it to do though - what type would you want project.flatten("/simulation'/highres") to return?

@castelao
Copy link

castelao commented Sep 5, 2022

Sounds wise preserving the structure. I have two suggestions on that:

  • Keep track of where it came from. Maybe use the global attribute source in data.nc and output.nc. Possibly pointing to the id attribute of the original project. id should be unique if following ACDD-1.3
  • Some relevant information might be left on higher levels. In your example, let's assume that depth is common between highres and lowres, so it was stored on the simulation group level. In that case highres/output.nc and lowres/output.nc are incomplete, not self containing. One option to avoid redundancy on higher-level variables would be using External Variables attribute.

On the flattening, I was thinking on something like project.subset(["/project/hires/temperature", "/project/hires/doxy"]).squeeze(). This would extract a subset of variables and all other variables and attributes from upper levels. Then, some cases might make sense to flatten "unnecessary layers". For instance, if the outcome is time, lat, lon, and sea surface height, I might just use a flat dataset. I envision cases where it makes sense to distribute a large consistent and complete datasets, such as a full simulation and its products, or all variables measured by different sensors aboard the same satellite, or all products from a single glider mission. But from the user perspective, it is common for someone to be interested in a single branch of that hierarchical tree, and in that case, information spread on multiple levels adds unnecessary complexity.

I have no strong opinion on any of those, but just ideas.

@dcherian
Copy link

dcherian commented Feb 7, 2023

There is no specific method for flattening parts of the tree

Just found this: https://gitlab.eumetsat.int/open-source/netcdf-flattener/

@castelao
Copy link

castelao commented Feb 8, 2023

@dcherian , thanks for pointing that out! @erget is a major contributor to the CF-Conventions and a great person to work with. Maybe there is a common interest here.

@Evidlo
Copy link

Evidlo commented Oct 11, 2023

Our team is interested in open_mfdatatree and save_mfdatatree as well, but for the purpose of avoiding large files. Our total dataset size is hundreds of GB and so it would be nice to have a set of smaller netCDF files for each week of data.

@keewis
Copy link
Contributor

keewis commented Aug 13, 2024

closing in favor of pydata/xarray#9351

@keewis keewis closed this as completed Aug 13, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
design question IO Representation of particular file formats as trees
Projects
None yet
Development

No branches or pull requests

5 participants