Add option to choose mfdataset attributes source. #3498

juseg · 2019-11-08T16:30:33Z

Add a master_file keyword arguments to open_mfdataset to choose the source of global attributes in a multi-file dataset.

Closes Add option to choose the source of global attributes in mfdataset. #2382
Tests added
Passes black . && mypy . && flake8
Fully documented, including whats-new.rst for all changes and api.rst for new API

juseg · 2019-11-11T11:05:52Z

I think I'm done. Can someone look at it? The master_file keyword argument is borrowed from NetCDF (see #2382 and Unidata/netcdf4-python#835), although xarray's mechanism is independant.

The default is 0, which is consistent with current xarray behaviour. When concatenating model output in time it would be more logical to use -1, i.e. preserve history of the last file.

xarray/tests/test_backends.py

xarray/backends/api.py

xarray/tests/test_backends.py

dcherian · 2019-11-11T15:00:01Z

xarray/backends/api.py

@@ -825,6 +826,10 @@ def open_mfdataset(
        - 'override': if indexes are of same size, rewrite indexes to be
          those of the first object with that dimension. Indexes for the same
          dimension must have the same size in all objects.
+    master_file : int or str, optional


This is netcDF4's documentation for master_file:

file to use as "master file", defining all the variables with an aggregation dimension and all global attributes.

let's make it clear that unlike netCDF4 we are only using this for attributes

Do you suggest to use a different keyword, maybe attrs_file?
Or just clarify the difference in the docs? I don't mind.
@dcherian Thanks for the review!

I was initially thinking of just adding a line to the docstring but we should think about renaming this to something like attrs_from?

So I've renamed it to attrs_file to avoid confusion with netCDF4. Thanks for pointing that. I am open to any name as long as the option is here.

@dcherian can we mark this as resolved? The attrs_file now only accept a file name (see other conversation below).

dcherian · 2019-11-11T15:04:01Z

Thanks @juseg . I've left a few comments.

I see that this is your first PR. Welcome to xarray! and thanks for contributing 👏

Co-Authored-By: Deepak Cherian <dcherian@users.noreply.github.com>

Unlike netCDF4's master_file this is only used for attributes.

juseg · 2019-11-14T17:49:29Z

This will add a new kwarg in open_mfdataset. The current name is attrs_file, defaulting to 0 (current behaviour). Or? Suggestions welcome.

TomNicholas · 2019-12-13T11:48:58Z

Thanks for this @juseg. The only problem I see is that a scalar number to specify the file only makes sense if it's a 1D list, but open_mfdataset can also accept nested list-of-lists (with combine='nested'), or ignore the order of the input entirely (with combine='by_coords'). What happens if you pass a list-of-lists of datasets?

On the other hand specifying the particular filepath or object makes sense in all cases, so perhaps the easiest way to avoid ambiguity would be to restrict to that option? (The default would just be left as-is.)

juseg · 2019-12-13T18:31:09Z

@TomNicholas Thanks for bringing the discussion live again! I'm not sure what happens in those cases, but I'm confident the default behaviour is unchanged, i.e. the attributes file is 0, whatever that 0 means (see my first commit).

If this is an issue I would suggest to discuss in a separate thread, as I think it is independent from my changes. On the other hand I am eager to keep the file number option because (1) attrs_file=-1 is the behaviour that I need (to ensure that history is always preserved) and (2) attrs_file=0 is the current behaviour (again, whatever that means for nested lists).

TomNicholas · 2019-12-13T21:53:08Z

I'm not sure we should merge changes if we're unsure how they will behave in certain circumstances.

On the other hand I am eager to keep the file number option because (1) attrs_file=-1 is the behaviour that I need

If we kept just the string specifier, you could still solve the problem of preserving the history:

files_to_open = ['filepath1', 'filepath2']

ds = open_mfdataset(files_to_open, attrs_file=files_to_open[-1])

But then the option would always have clear and well-defined behaviour, even in more complex cases like combine_by_coordsor combine='nested' with a >1D input file list.

juseg · 2019-12-14T13:00:28Z

@TomNicholas I've had a closer look at the code. Nested lists of file paths are processed by:

xarray/xarray/backends/api.py

Lines 881 to 882 in f2b2f9f

    
           combined_ids_paths = _infer_concat_order_from_positions(paths) 
        
           ids, paths = (list(combined_ids_paths.keys()), list(combined_ids_paths.values()))

Using the method defined in:

xarray/xarray/core/combine.py

Lines 15 to 46 in f2b2f9f

    
           def _infer_concat_order_from_positions(datasets): 
        
               combined_ids = dict(_infer_tile_ids_from_nested_list(datasets, ())) 
        
               return combined_ids 
        
           def _infer_tile_ids_from_nested_list(entry, current_pos): 
        
               """ 
        
               Given a list of lists (of lists...) of objects, returns a iterator 
        
               which returns a tuple containing the index of each object in the nested 
        
               list structure as the key, and the object. This can then be called by the 
        
               dict constructor to create a dictionary of the objects organised by their 
        
               position in the original nested list. 
        
               Recursively traverses the given structure, while keeping track of the 
        
               current position. Should work for any type of object which isn't a list. 
        
               Parameters 
        
               ---------- 
        
               entry : list[list[obj, obj, ...], ...] 
        
                   List of lists of arbitrary depth, containing objects in the order 
        
                   they are to be concatenated. 
        
               Returns 
        
               ------- 
        
               combined_tile_ids : dict[tuple(int, ...), obj] 
        
               """ 
        
               if isinstance(entry, list): 
        
                   for i, item in enumerate(entry): 
        
                       yield from _infer_tile_ids_from_nested_list(item, current_pos + (i,)) 
        
               else: 
        
                   yield current_pos, entry

In Python 3.7+ the list of paths is essentially flattened list e.g.:

>>> import xarray as xr
>>> paths = [list('abc'), list('def')]
>>> list(xr.core.combine._infer_concat_order_from_positions(paths).values())
['a', 'b', 'c', 'd', 'e', 'f']

Unfortunately the current code uses a dictionary which means that in Python 3.6- the order is not guaranteed preserved. This also implies that the current default 0 is not well defined in case of nested lists.

xarray/xarray/backends/api.py

Line 900 in f2b2f9f

datasets = [open_(p, **open_kwargs) for p in paths]

xarray/xarray/backends/api.py

Line 964 in f2b2f9f

combined.attrs = datasets[0].attrs

On the other hands the ids are essentially ND indexes that could perhaps be used...

>>> list(xr.core.combine._infer_concat_order_from_positions(paths).keys())
[(0, 0), (0, 1), (0, 2), (1, 0), (1, 1), (1, 2)]

Or should we just stick to file paths as you suggest? And leave the default as is (e.g. ambiguous for Python 3.6-)?

TomNicholas · 2019-12-14T15:04:09Z

Thanks @juseg .

in Python 3.6- the order is not guaranteed preserved.

I think for python 3.6 and above the order is preserved isn't it?

the current default 0 is not well defined in case of nested lists.

Yes, this is what I was thinking of.

the ids are essentially ND indexes that could perhaps be used...

We could do this, and that's how we would solve it in general, but I don't really think it's worth the effort/complexity.

Or should we just stick to file paths as you suggest? And leave the default as is

I think so - if we do this then users can still easily pick the attributes from the file of their choosing (solving the original issue), and if someone wants to be able to choose the attrs_file in another way later then we can worry about that then.

Index behaviour is ambiguous for nested lists on older Python versions. The default remains index 0, which is backward-compatible but also ambiguous in this case (see docstring and pull request #3498).

xarray/backends/api.py

keewis

I left a few comments about passing Path objects, but other than that this looks good to me.

xarray/backends/api.py

Co-Authored-By: keewis <keewis@users.noreply.github.com>

TomNicholas · 2020-01-11T15:21:51Z

I think there's nothing left to do here, thanks @juseg!

* upstream/master: allow passing any iterable to drop when dropping variables (pydata#3693) Typo on DataSet/DataArray.to_dict documentation (pydata#3692) Fix mypy type checking tests failure in ds.merge (pydata#3690) Explicitly convert result of pd.to_datetime to a timezone-naive type (pydata#3688) ds.merge(da) bugfix (pydata#3677) fix docstring for combine_first: returns a Dataset (pydata#3683) Add option to choose mfdataset attributes source. (pydata#3498) How do I add a new variable to dataset. (pydata#3679) Add map_blocks example to whats-new (pydata#3682) Make dask names change when chunking Variables by different amounts. (pydata#3584) raise an error when renaming dimensions to existing names (pydata#3645) Support swap_dims to dimension names that are not existing variables (pydata#3636) Add map_blocks example to docs. (pydata#3667) add multiindex level name checking to .rename() (pydata#3658)

* upstream/master: Add an example notebook using apply_ufunc to vectorize 1D functions (pydata#3629) Use encoding['dtype'] over data.dtype when possible within CFMaskCoder.encode (pydata#3652) allow passing any iterable to drop when dropping variables (pydata#3693) Typo on DataSet/DataArray.to_dict documentation (pydata#3692) Fix mypy type checking tests failure in ds.merge (pydata#3690) Explicitly convert result of pd.to_datetime to a timezone-naive type (pydata#3688) ds.merge(da) bugfix (pydata#3677) fix docstring for combine_first: returns a Dataset (pydata#3683) Add option to choose mfdataset attributes source. (pydata#3498) How do I add a new variable to dataset. (pydata#3679) Add map_blocks example to whats-new (pydata#3682) Make dask names change when chunking Variables by different amounts. (pydata#3584) raise an error when renaming dimensions to existing names (pydata#3645) Support swap_dims to dimension names that are not existing variables (pydata#3636) Add map_blocks example to docs. (pydata#3667) add multiindex level name checking to .rename() (pydata#3658)

* upstream/master: (23 commits) Feature/align in dot (pydata#3699) ENH: enable `H5NetCDFStore` to work with already open h5netcdf.File a… (pydata#3618) One-off isort run (pydata#3705) hardcoded xarray.__all__ (pydata#3703) Bump mypy to v0.761 (pydata#3704) remove DataArray and Dataset constructor deprecations for 0.15 (pydata#3560) Tests for variables with units (pydata#3654) Add an example notebook using apply_ufunc to vectorize 1D functions (pydata#3629) Use encoding['dtype'] over data.dtype when possible within CFMaskCoder.encode (pydata#3652) allow passing any iterable to drop when dropping variables (pydata#3693) Typo on DataSet/DataArray.to_dict documentation (pydata#3692) Fix mypy type checking tests failure in ds.merge (pydata#3690) Explicitly convert result of pd.to_datetime to a timezone-naive type (pydata#3688) ds.merge(da) bugfix (pydata#3677) fix docstring for combine_first: returns a Dataset (pydata#3683) Add option to choose mfdataset attributes source. (pydata#3498) How do I add a new variable to dataset. (pydata#3679) Add map_blocks example to whats-new (pydata#3682) Make dask names change when chunking Variables by different amounts. (pydata#3584) raise an error when renaming dimensions to existing names (pydata#3645) ...

juseg added 4 commits November 8, 2019 16:48

Add 'master_file' kwarg in open_mfdataset.

702d571

Allow master_file to be a path or an index.

f83a634

Add open_mfdataset master_file kwarg tests.

6b7d52b

Document master_file kwarg in whats-new.rst.

a02ff9a

dcherian requested changes Nov 11, 2019

View reviewed changes

juseg and others added 2 commits November 13, 2019 14:08

Add dcherian suggestions in master_file tests.

8cc3b34

Co-Authored-By: Deepak Cherian <dcherian@users.noreply.github.com>

Rename master_file kwarg to attrs_file.

64f9785

Unlike netCDF4's master_file this is only used for attributes.

juseg added 3 commits December 14, 2019 17:08

Require master_file as a path, forbid indexes.

fb1d91f

Index behaviour is ambiguous for nested lists on older Python versions. The default remains index 0, which is backward-compatible but also ambiguous in this case (see docstring and pull request #3498).

Merge branch 'master' into master_file

b6b1d57

Fix wrongly resolved conflict in waths-new.rst.

8a44831

TomNicholas reviewed Dec 16, 2019

View reviewed changes

xarray/backends/api.py Outdated Show resolved Hide resolved

juseg added 3 commits December 30, 2019 16:25

Simplify docstring on attrs_file.

aa62e18

Merge branch 'master' into master_file

aa437a4

Merge branch master into master_file.

81e1a6f

keewis reviewed Jan 11, 2020

View reviewed changes

xarray/backends/api.py Outdated Show resolved Hide resolved

xarray/backends/api.py Show resolved Hide resolved

juseg and others added 3 commits January 11, 2020 15:53

Allow attrs_file to be a pathlib.Path.

1347ab5

Add a test for attrs_file using pathlib.Path.

aa72572

Document that attrs_file can now be a pathlib.Path.

175de6c

Co-Authored-By: keewis <keewis@users.noreply.github.com>

TomNicholas closed this Jan 11, 2020

TomNicholas reopened this Jan 11, 2020

TomNicholas merged commit 099c090 into pydata:master Jan 11, 2020

juseg deleted the master_file branch January 11, 2020 18:28

This was referenced Mar 23, 2020

Parallel interpolation boutproject/xBOUT#108

Merged

Control attrs of result in merge(), concat(), combine_by_coords() and combine_nested() #3877

Merged

TomNicholas mentioned this pull request Apr 5, 2020

save "encoding" when using open_mfdataset #2436

Open

TomNicholas added the topic-metadata Relating to the handling of metadata (i.e. attrs and encoding) label Apr 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add option to choose mfdataset attributes source. #3498

Add option to choose mfdataset attributes source. #3498

juseg commented Nov 8, 2019 •

edited

Loading

juseg commented Nov 11, 2019

dcherian Nov 11, 2019

juseg Nov 13, 2019

dcherian Nov 13, 2019

juseg Nov 14, 2019

juseg Dec 30, 2019

dcherian commented Nov 11, 2019

juseg commented Nov 14, 2019

TomNicholas commented Dec 13, 2019 •

edited

Loading

juseg commented Dec 13, 2019

TomNicholas commented Dec 13, 2019

juseg commented Dec 14, 2019

TomNicholas commented Dec 14, 2019

keewis left a comment

TomNicholas commented Jan 11, 2020

Add option to choose mfdataset attributes source. #3498

Add option to choose mfdataset attributes source. #3498

Conversation

juseg commented Nov 8, 2019 • edited Loading

juseg commented Nov 11, 2019

dcherian Nov 11, 2019

Choose a reason for hiding this comment

juseg Nov 13, 2019

Choose a reason for hiding this comment

dcherian Nov 13, 2019

Choose a reason for hiding this comment

juseg Nov 14, 2019

Choose a reason for hiding this comment

juseg Dec 30, 2019

Choose a reason for hiding this comment

dcherian commented Nov 11, 2019

juseg commented Nov 14, 2019

TomNicholas commented Dec 13, 2019 • edited Loading

juseg commented Dec 13, 2019

TomNicholas commented Dec 13, 2019

juseg commented Dec 14, 2019

TomNicholas commented Dec 14, 2019

keewis left a comment

Choose a reason for hiding this comment

TomNicholas commented Jan 11, 2020

juseg commented Nov 8, 2019 •

edited

Loading

TomNicholas commented Dec 13, 2019 •

edited

Loading