Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ignore missing variables when concatenating datasets? #508

Closed
shoyer opened this issue Aug 2, 2015 · 8 comments · Fixed by #3364 or #7400
Closed

Ignore missing variables when concatenating datasets? #508

shoyer opened this issue Aug 2, 2015 · 8 comments · Fixed by #3364 or #7400
Labels
topic-combine combine/concat/merge

Comments

@shoyer
Copy link
Member

shoyer commented Aug 2, 2015

Several users (@raj-kesavan, @richardotis, now myself) have wondered about how to concatenate xray Datasets with different variables.

With the current xray.concat, you need to awkwardly create dummy variables filled with NaN in datasets that don't have them (or drop mismatched variables entirely). Neither of these are great options -- concat should have an option (the default?) to take care of this for the user.

This would also be more consistent with pd.concat, which takes a more relaxed approach to matching dataframes with different variables (it does an outer join).

@max-sixty
Copy link
Collaborator

Closing as stale, please reopen if still relevant

@scottcha
Copy link
Contributor

scottcha commented Nov 14, 2019

I just ran in to this issue. While the previous fix seems to handle one case it doesn't handle all the cases. Before I clean this up and open a new PR does this look like its on the right track (it worked for my issue where I was concating multiple datasets which always had the same dims and coordinates but sometimes were missing variables)?

starts at line 353 on concat.py

for k in datasets[0].variables:
       if k in concat_over:
           try:
               #new code
               for ds in datasets:
                   if k not in ds.variables:
                       #make a new array with the same dimensions and coordinates
                       #by default this will be initialized to np.nan which is what we want
                       from .dataarray import DataArray
                       new_array = DataArray(coords=ds.coords, dims=ds.dims)
                       ds[k] = new_array
               #end new code
               vars = ensure_common_dims([ds.variables[k] for ds in datasets])
           except KeyError: 
              #this can likely be removed then
               raise ValueError("%r is not present in all datasets." % k)
           combined = concat_vars(vars, dim, positions)
           assert isinstance(combined, Variable)
           result_vars[k] = combined

@dcherian
Copy link
Contributor

Thanks for tackling this very important issue @scottcha !

from .dataarray import DataArray
new_array = DataArray(coords=ds.coords, dims=ds.dims)
ds[k] = new_array

Instead of creating a DataArray we only need to create a Variable (https://xarray.pydata.org/en/stable/internals.html#variable-objects).

I would instead try full_like(example_variable, fill_value=np.nan) (import full_like from the appropriate file). The trick would be figuring out what example_variable is. Maybe like this? (there may be some clever way to avoid the two loops)

variables = []
for ds in datasets:
    if k in ds.variables:
         filled = full_like(ds.variables[k], fill_value=np.nan)
         break

for ds in datasets:
    if k not in ds.variables:
        variables.append(filled)
    else:
        variables.append(ds.variables[k])

vars = ensure_common_dims(variables)

Please send in a PR with any progress you make. We are happy to help out. We have some documentation on contributing and testing here: https://xarray.pydata.org/en/stable/contributing.html

@scottcha
Copy link
Contributor

Ok got it, I'll take a look and spin up a PR.
Thanks

@Filip-K
Copy link

Filip-K commented Jun 2, 2022

Hi guys! Just to clarify, this is not fixed by #3769 (which only concerns coordinates, not variables) nor by #3364 (which concerns merge not concat). It would be fixed by #3545, but this one is not merged yet. Right?

@dcherian
Copy link
Contributor

dcherian commented Jun 2, 2022

Yes that is correct

@zoj613
Copy link

zoj613 commented Nov 9, 2022

Any plans to support this?

@kmuehlbauer
Copy link
Contributor

There is another attempt to get this resolved in #7400. Any input appreciated over there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-combine combine/concat/merge
Projects
None yet
7 participants