Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to choose mfdataset attributes source. #3498

Merged
merged 15 commits into from
Jan 11, 2020
Merged

Add option to choose mfdataset attributes source. #3498

merged 15 commits into from
Jan 11, 2020

Conversation

juseg
Copy link
Contributor

@juseg juseg commented Nov 8, 2019

Add a master_file keyword arguments to open_mfdataset to choose the source of global attributes in a multi-file dataset.

@juseg
Copy link
Contributor Author

juseg commented Nov 11, 2019

I think I'm done. Can someone look at it? The master_file keyword argument is borrowed from NetCDF (see #2382 and Unidata/netcdf4-python#835), although xarray's mechanism is independant.

The default is 0, which is consistent with current xarray behaviour. When concatenating model output in time it would be more logical to use -1, i.e. preserve history of the last file.

xarray/tests/test_backends.py Outdated Show resolved Hide resolved
xarray/backends/api.py Outdated Show resolved Hide resolved
xarray/tests/test_backends.py Outdated Show resolved Hide resolved
xarray/tests/test_backends.py Outdated Show resolved Hide resolved
@@ -825,6 +826,10 @@ def open_mfdataset(
- 'override': if indexes are of same size, rewrite indexes to be
those of the first object with that dimension. Indexes for the same
dimension must have the same size in all objects.
master_file : int or str, optional
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is netcDF4's documentation for master_file:

file to use as "master file", defining all the variables with an aggregation dimension and all global attributes.

let's make it clear that unlike netCDF4 we are only using this for attributes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you suggest to use a different keyword, maybe attrs_file?
Or just clarify the difference in the docs? I don't mind.
@dcherian Thanks for the review!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was initially thinking of just adding a line to the docstring but we should think about renaming this to something like attrs_from?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I've renamed it to attrs_file to avoid confusion with netCDF4. Thanks for pointing that. I am open to any name as long as the option is here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dcherian can we mark this as resolved? The attrs_file now only accept a file name (see other conversation below).

@dcherian
Copy link
Contributor

Thanks @juseg . I've left a few comments.

I see that this is your first PR. Welcome to xarray! and thanks for contributing 👏

juseg and others added 2 commits November 13, 2019 14:08
Co-Authored-By: Deepak Cherian <dcherian@users.noreply.github.com>
Unlike netCDF4's master_file this is only used for attributes.
@juseg
Copy link
Contributor Author

juseg commented Nov 14, 2019

This will add a new kwarg in open_mfdataset. The current name is attrs_file, defaulting to 0 (current behaviour). Or? Suggestions welcome.

@TomNicholas
Copy link
Member

TomNicholas commented Dec 13, 2019

Thanks for this @juseg. The only problem I see is that a scalar number to specify the file only makes sense if it's a 1D list, but open_mfdataset can also accept nested list-of-lists (with combine='nested'), or ignore the order of the input entirely (with combine='by_coords'). What happens if you pass a list-of-lists of datasets?

On the other hand specifying the particular filepath or object makes sense in all cases, so perhaps the easiest way to avoid ambiguity would be to restrict to that option? (The default would just be left as-is.)

@juseg
Copy link
Contributor Author

juseg commented Dec 13, 2019

@TomNicholas Thanks for bringing the discussion live again! I'm not sure what happens in those cases, but I'm confident the default behaviour is unchanged, i.e. the attributes file is 0, whatever that 0 means (see my first commit).

If this is an issue I would suggest to discuss in a separate thread, as I think it is independent from my changes. On the other hand I am eager to keep the file number option because (1) attrs_file=-1 is the behaviour that I need (to ensure that history is always preserved) and (2) attrs_file=0 is the current behaviour (again, whatever that means for nested lists).

@TomNicholas
Copy link
Member

I'm not sure we should merge changes if we're unsure how they will behave in certain circumstances.

On the other hand I am eager to keep the file number option because (1) attrs_file=-1 is the behaviour that I need

If we kept just the string specifier, you could still solve the problem of preserving the history:

files_to_open = ['filepath1', 'filepath2']

ds = open_mfdataset(files_to_open, attrs_file=files_to_open[-1])

But then the option would always have clear and well-defined behaviour, even in more complex cases like combine_by_coordsor combine='nested' with a >1D input file list.

@juseg
Copy link
Contributor Author

juseg commented Dec 14, 2019

@TomNicholas I've had a closer look at the code. Nested lists of file paths are processed by:

combined_ids_paths = _infer_concat_order_from_positions(paths)
ids, paths = (list(combined_ids_paths.keys()), list(combined_ids_paths.values()))

Using the method defined in:

def _infer_concat_order_from_positions(datasets):
combined_ids = dict(_infer_tile_ids_from_nested_list(datasets, ()))
return combined_ids
def _infer_tile_ids_from_nested_list(entry, current_pos):
"""
Given a list of lists (of lists...) of objects, returns a iterator
which returns a tuple containing the index of each object in the nested
list structure as the key, and the object. This can then be called by the
dict constructor to create a dictionary of the objects organised by their
position in the original nested list.
Recursively traverses the given structure, while keeping track of the
current position. Should work for any type of object which isn't a list.
Parameters
----------
entry : list[list[obj, obj, ...], ...]
List of lists of arbitrary depth, containing objects in the order
they are to be concatenated.
Returns
-------
combined_tile_ids : dict[tuple(int, ...), obj]
"""
if isinstance(entry, list):
for i, item in enumerate(entry):
yield from _infer_tile_ids_from_nested_list(item, current_pos + (i,))
else:
yield current_pos, entry

In Python 3.7+ the list of paths is essentially flattened list e.g.:

>>> import xarray as xr
>>> paths = [list('abc'), list('def')]
>>> list(xr.core.combine._infer_concat_order_from_positions(paths).values())
['a', 'b', 'c', 'd', 'e', 'f']

Unfortunately the current code uses a dictionary which means that in Python 3.6- the order is not guaranteed preserved. This also implies that the current default 0 is not well defined in case of nested lists.

datasets = [open_(p, **open_kwargs) for p in paths]

combined.attrs = datasets[0].attrs

On the other hands the ids are essentially ND indexes that could perhaps be used...

>>> list(xr.core.combine._infer_concat_order_from_positions(paths).keys())
[(0, 0), (0, 1), (0, 2), (1, 0), (1, 1), (1, 2)]

Or should we just stick to file paths as you suggest? And leave the default as is (e.g. ambiguous for Python 3.6-)?

@TomNicholas
Copy link
Member

Thanks @juseg .

in Python 3.6- the order is not guaranteed preserved.

I think for python 3.6 and above the order is preserved isn't it?

the current default 0 is not well defined in case of nested lists.

Yes, this is what I was thinking of.

the ids are essentially ND indexes that could perhaps be used...

We could do this, and that's how we would solve it in general, but I don't really think it's worth the effort/complexity.

Or should we just stick to file paths as you suggest? And leave the default as is

I think so - if we do this then users can still easily pick the attributes from the file of their choosing (solving the original issue), and if someone wants to be able to choose the attrs_file in another way later then we can worry about that then.

Index behaviour is ambiguous for nested lists on older Python versions.
The default remains index 0, which is backward-compatible but also
ambiguous in this case (see docstring and pull request #3498).
xarray/backends/api.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@keewis keewis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a few comments about passing Path objects, but other than that this looks good to me.

xarray/backends/api.py Outdated Show resolved Hide resolved
xarray/backends/api.py Show resolved Hide resolved
@TomNicholas
Copy link
Member

I think there's nothing left to do here, thanks @juseg!

@TomNicholas TomNicholas reopened this Jan 11, 2020
@TomNicholas TomNicholas merged commit 099c090 into pydata:master Jan 11, 2020
@juseg juseg deleted the master_file branch January 11, 2020 18:28
dcherian added a commit to dcherian/xarray that referenced this pull request Jan 14, 2020
* upstream/master:
  allow passing any iterable to drop when dropping variables (pydata#3693)
  Typo on DataSet/DataArray.to_dict documentation (pydata#3692)
  Fix mypy type checking tests failure in ds.merge (pydata#3690)
  Explicitly convert result of pd.to_datetime to a timezone-naive type (pydata#3688)
  ds.merge(da) bugfix (pydata#3677)
  fix docstring for combine_first: returns a Dataset (pydata#3683)
  Add option to choose mfdataset attributes source. (pydata#3498)
  How do I add a new variable to dataset. (pydata#3679)
  Add map_blocks example to whats-new (pydata#3682)
  Make dask names change when chunking Variables by different amounts. (pydata#3584)
  raise an error when renaming dimensions to existing names (pydata#3645)
  Support swap_dims to dimension names that are not existing variables (pydata#3636)
  Add map_blocks example to docs. (pydata#3667)
  add multiindex level name checking to .rename() (pydata#3658)
dcherian added a commit to dcherian/xarray that referenced this pull request Jan 15, 2020
* upstream/master:
  Add an example notebook using apply_ufunc to vectorize 1D functions (pydata#3629)
  Use encoding['dtype'] over data.dtype when possible within CFMaskCoder.encode (pydata#3652)
  allow passing any iterable to drop when dropping variables (pydata#3693)
  Typo on DataSet/DataArray.to_dict documentation (pydata#3692)
  Fix mypy type checking tests failure in ds.merge (pydata#3690)
  Explicitly convert result of pd.to_datetime to a timezone-naive type (pydata#3688)
  ds.merge(da) bugfix (pydata#3677)
  fix docstring for combine_first: returns a Dataset (pydata#3683)
  Add option to choose mfdataset attributes source. (pydata#3498)
  How do I add a new variable to dataset. (pydata#3679)
  Add map_blocks example to whats-new (pydata#3682)
  Make dask names change when chunking Variables by different amounts. (pydata#3584)
  raise an error when renaming dimensions to existing names (pydata#3645)
  Support swap_dims to dimension names that are not existing variables (pydata#3636)
  Add map_blocks example to docs. (pydata#3667)
  add multiindex level name checking to .rename() (pydata#3658)
dcherian added a commit to dcherian/xarray that referenced this pull request Jan 21, 2020
* upstream/master: (23 commits)
  Feature/align in dot (pydata#3699)
  ENH: enable `H5NetCDFStore` to work with already open h5netcdf.File a… (pydata#3618)
  One-off isort run (pydata#3705)
  hardcoded xarray.__all__ (pydata#3703)
  Bump mypy to v0.761 (pydata#3704)
  remove DataArray and Dataset constructor deprecations for 0.15  (pydata#3560)
  Tests for variables with units (pydata#3654)
  Add an example notebook using apply_ufunc to vectorize 1D functions (pydata#3629)
  Use encoding['dtype'] over data.dtype when possible within CFMaskCoder.encode (pydata#3652)
  allow passing any iterable to drop when dropping variables (pydata#3693)
  Typo on DataSet/DataArray.to_dict documentation (pydata#3692)
  Fix mypy type checking tests failure in ds.merge (pydata#3690)
  Explicitly convert result of pd.to_datetime to a timezone-naive type (pydata#3688)
  ds.merge(da) bugfix (pydata#3677)
  fix docstring for combine_first: returns a Dataset (pydata#3683)
  Add option to choose mfdataset attributes source. (pydata#3498)
  How do I add a new variable to dataset. (pydata#3679)
  Add map_blocks example to whats-new (pydata#3682)
  Make dask names change when chunking Variables by different amounts. (pydata#3584)
  raise an error when renaming dimensions to existing names (pydata#3645)
  ...
@TomNicholas TomNicholas added the topic-metadata Relating to the handling of metadata (i.e. attrs and encoding) label Apr 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-metadata Relating to the handling of metadata (i.e. attrs and encoding)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add option to choose the source of global attributes in mfdataset.
4 participants