Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

open_mfdataset, function provided to combine_attrs has confusing behavior: multiple calls? separate last group of attributes? #6679

Open
gg2 opened this issue Jun 9, 2022 · 5 comments

Comments

@gg2
Copy link

gg2 commented Jun 9, 2022

What is your issue?

I'm attempting to use the combine_attrs parameter on open_mfdataset with a function to generate and preserve a list of values from specific attributes on specific variables.

I use a library (act-atmos) that uses xarray as a dependency.
I use act-atmos' act.io.armfiles.read_netcdf method to read in the data from a list of files.
( https://github.com/ARM-DOE/ACT/blob/main/act/io/armfiles.py )
I provide to read_netcdf the function to use for the combine_attrs parameter.

I am fairly confident that act-atmos does nothing unexpected with the list of files or parameters.
It sets combine = 'by_coords', use_cftime = True, and combine_attrs = untouched.
The above parameters are provided to open_mfdataset via **kwargs; as well as passing through the list of filenames untouched.

I see unexpected behavior in the combine_attrs function.
To test, I read in 3 netCDF files, with records starting at 6:00am day1, ending 6:00am day2.
The resultant combined data has 4 days. Really I expect 3 days of data, but since the files overlap the next day from midnight to 6:00am, it makes sense I end up with 4 days. But the last, 4th day will always have no data, because the data starts during daylight hours.

# Using the combined xarray data returned by open_mfdataset:
len(xarray_data['time'].values) # 1440 * 3
xarray_data['time'].values)
xarray_data['time'].coords

4320
['2009-01-01T06:00:00.000000000' '2009-01-01T06:01:00.000000000'
'2009-01-01T06:02:00.000000000' ... '2009-01-04T05:57:00.000000000'
'2009-01-04T05:58:00.000000000' '2009-01-04T05:59:00.000000000']
Coordinates: time (time) datetime64[ns] 2009-01-01T06:00:00 ... 2009-01-04T05:59:00

So, I expect combine_attrs to receive a list of 3-4 sets of attributes to iterate through all at once.
Instead, it apparently gets called twice: once with a list of 3 sets of attributes, and a 2nd time with a list of 1 set of the attributes.

The last lone set does contain the combined attributes from the first call.
But given I was expecting to iterate through a single list of all attributes, I have to implement special logic to watch for that last set. If I don't, then the last set in the 2nd call obliterates the results of the 1st set of 3, and the result ends up looking similar to combine_attrs="override".
Without further experimentation, I don't know yet if this behavior will remain consistent for larger sets of files. I will experiment soon to see if it does remain consistent.

Is this behavior of combine_attrs expected? Why would it be set up to make 2 (or multiple) separate calls like that?
Is it somehow a side-effect of the files having times that overlap days? Is it somehow because the last dataset is essentially empty? If I end up with one combined set of data, I would expect one combined list of attributes to iterate through.
If this behavior is unexpected, I will provide more specifics about the data as requested, to keep this initial post shorter.


Also, as a side note, it would be nice to improve the documentation for combine_attrs.
The details about the parameter don't even show up here:
https://docs.xarray.dev/en/stable/generated/xarray.open_mfdataset.html

And the description of the callable signature in the source code could be clearer:

  • What is a "sequence of" attrs dicts? Is "sequence of" the reason I'm seeing multiple calls to combine_attrs?
  • What might the context object possibly contain? When or why might it vary?

With experimentation I can find some of these things out (e.g. context is None, in my case); but it would be nice if these were clearer up front.

@gg2 gg2 added the needs triage Issue that has not been reviewed by xarray team member label Jun 9, 2022
@keewis
Copy link
Collaborator

keewis commented Jun 9, 2022

Thanks for the report, @gg2

combine_attrs should be called once for the datasets (Dataset.attrs) and then once for each combined variable that has attributes. I'm not sure why you would get a list containing just a single dict, though, that should have at least two entries.

The context object will contain more information about the context of the call, for example the method that was used to combine the datasets (xr.concat, xr.merge, xr.combine_by_coords, or xr.combine_nested) and its arguments (see #5668, but that PR has stalled unfortunately).

I did plan to create a dedicated user guide that explains all of this, but didn't manage to find the time to actually complete that.

The details about the parameter don't even show up here:

not sure why it's missing from stable, but it does show up in the latest documentation build, and should also be included in the next release.

@keewis
Copy link
Collaborator

keewis commented Jun 9, 2022

as far as I can tell, read_netcdf will rerun open_mfdataset with different options (most notably combine="nested" instead of combine="by_coords") in case there's an error, maybe that's the difference?

Could you maybe try a bare open_mfdataset with the same options to see what happens? Otherwise I'd indeed need some example data to be able to help debugging.

@keewis keewis removed the needs triage Issue that has not been reviewed by xarray team member label Jun 9, 2022
@gg2
Copy link
Author

gg2 commented Jun 9, 2022

OK. That makes sense that there could be an error. Especially because even in the wrapping act-atmos, they do some fall-through to "nested" for certain conditions on parameters.

I did see the dataset's "global" attributes coming through in the first pass. I didn't pay attention if they also came through in the 2nd pass.

Lastly, I did get my "simpler" combine_attrs callable working, even with the 2 passes. I think between all things (mainly, me being confused and trying to get things working and making some bad interpretations), the 2nd pass didn't necessarily throw it off. I either had some incorrect logic, or I used an operator that didn't get interpreted as expected (+= originally to combine previous + new values; vs switching to f"{previous}, {new}"). I'm not sure.

But, I'll try a direct call to open_mfdataset and see what that does.

This, for reference, is the function I'm now passing in. It successfully does what I wanted:

def resample_attrs_final(attrs_list, context):
    #print(context) # None
    #print(f"attrs_list length: {len(attrs_list)}")

    stats_atts = ['daily_mean', 'daily_std', 'number_good_points']
    resampled_attrs = {}
    
    for attrs in attrs_list:
        for attr_name, attr_val in attrs.items():
            if attr_name not in resampled_attrs:
                resampled_attrs[attr_name] = attr_val
            elif attr_name in stats_atts:
                resampled_attrs[attr_name] = f"{resampled_attrs[attr_name]},{attr_val}"
                #resampled_attrs[attr_name] += f",{attr_val}" # Original attempt

    #print(f"Exiting : {','.join(list(resampled_attrs.keys()))}")
    return resampled_attrs

(I am happy that attribute values aren't forced to be the particular data type they might be declared as in netCDF. :D It is nice I can turn a float attribute into a string attribute.)

@gg2
Copy link
Author

gg2 commented Jun 9, 2022

I ran the following:

try:
    xr_data = xr.open_mfdataset(files_list, combine='by_coords', concat_dim=None, use_cftime=True, combine_attrs=resample_attrs)
except Exception as e:
    print(f"Error reading in netCDF data, {type(e)}:\n{e}")

I do see 2 sets coming through still for len(files_list) == 3:

  • 1st with len 3 for the attributes list.
  • 2nd with len 1 for the attributes list.

I didn't see any exceptions.

@brendan-m-murphy
Copy link

The same issue comes up when passing a function for combine_attrs in xr.concat.

My workaround was to add an attribute to the datasets as a flag, and have the function I passed act differently if that flag was present (dataset attrs) or not (data variable attrs).

Is this what the 'context' argument will do? I couldn't tell what it was supposed to do by looking at the languishing PR #5668.

Being able to pass separate arguments for combine_attrs for dataset attrs and data var attrs in concat might be useful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants