Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fill missing data_vars during concat by reindexing #7400

Merged
merged 39 commits into from
Jan 20, 2023

Conversation

kmuehlbauer
Copy link
Contributor

@kmuehlbauer kmuehlbauer commented Dec 22, 2022

This is another attempt to solve #508. Took inspiration from #3545 by @scottcha.

This follows along @dcherian's comment in the same above PR (#3545 (comment)).

Update:

After review the variable order is estimated by order of appearance in the list of datasets. That keeps full backwards compatibility and is deterministic. Thanks @dcherian and @keewis for the suggestions.

@kmuehlbauer
Copy link
Contributor Author

There are no tests added currently. I'm just wondering, if that approach would work in general.

The current assumptions here:

  • works only on data variables, not coordinates
  • order is taken from first dataset, variables missing in first dataset are appended at the end
  • missing variables are filled with fill_value by reindexing (thanks @dcherian for the inspiration).

@Illviljan Illviljan added the run-benchmark Run the ASV benchmark workflow label Dec 27, 2022
@kmuehlbauer
Copy link
Contributor Author

Thanks @Illviljan for activating the benchmark runs. Are those errors related to the changes? I'm not up to date with mypy, are these errors induced by changes here?

@Illviljan
Copy link
Contributor

Benchmark is a numba issue, probably #7306.

Mypy is real, cannot getitem a object. Try out using isinstance instead of the try/except to narrow the typing.

@github-actions github-actions bot removed the run-benchmark Run the ASV benchmark workflow label Jan 4, 2023
@kmuehlbauer
Copy link
Contributor Author

OK, green light's now also on mypy. Looks like the approach would work in general. Trying to add some tests now.

@kmuehlbauer
Copy link
Contributor Author

This is ready for a first round of review. Thanks!

@scottcha
Copy link
Contributor

scottcha commented Jan 8, 2023

Hi, thanks for doing this @kmuehlbauer . FWIW I'm no longer seeing the issue I was previously seeing when I submitted #3545 when I just run with released xarray v2022.12.0 (I haven't gone back further to see when the issue started going away so I'm not really sure if the old error has just been suppressed or if the single case I was seeing back then was resolved in a previous PR--or there is also a chance there is something which changed in my data over that long time period).

That being said I also applied this PR to my workflow and reran the concat code and it continues to pass correctly with this PR from what I've seen so far.

@kmuehlbauer did you see the tests I created here? https://github.com/pydata/xarray/blob/03f9b3b85aee039f47dd693322492ab09f57fb73/xarray/tests/test_concat.py
Not all of them got to a passing state but there were several cases I tried to document with the tests there.

@kmuehlbauer
Copy link
Contributor Author

Thanks @scottcha for taking the time to testing things.

I'll have a look at your tests in more detail now. I've concentrated on my use case in the first place and hoped to get away with it 😀.

Copy link
Contributor

@Illviljan Illviljan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some typing suggestions.

xarray/tests/test_concat.py Outdated Show resolved Hide resolved
xarray/tests/test_concat.py Outdated Show resolved Hide resolved
xarray/tests/test_concat.py Outdated Show resolved Hide resolved
xarray/tests/test_concat.py Outdated Show resolved Hide resolved
@kmuehlbauer
Copy link
Contributor Author

Some typing suggestions.

Thanks @Illviljan, your suggestions and help is much appreciated.

@kmuehlbauer
Copy link
Contributor Author

@scottcha I've found a glitch in the code due to your tests. Already pushed the changes here.

I'm going to cherry pick your tests here next.

@kmuehlbauer
Copy link
Contributor Author

@scottcha I've tried to cherry-pick, but ended up copy/pasting and adding your authorship to the commit.

I think the final problem is the order in:

  • test_concat_missing_multiple_consecutive_var
  • test_multiple_datasets_with_multiple_missing_variables

These tests are flaky. Sometimes the order is correct and sometimes not. Can't immediately see the root cause here.

@Illviljan I'll try to add typing to these additional tests. Should be good for learning that.

@kmuehlbauer
Copy link
Contributor Author

@Illviljan OK, I'm stuck now. I can't make anything out of the remaining mypy errors. Would be great if you could have another look here, thanks!

xarray/tests/test_concat.py Outdated Show resolved Hide resolved
xarray/tests/test_concat.py Outdated Show resolved Hide resolved
@kmuehlbauer
Copy link
Contributor Author

@scottcha I think I've managed to get along with your tests. It looks like everything is running now.

One thing which is still unresolved:

  • The order of data variables which are not available in the first dataset is not deterministic because of using set for gathering all variables. But maybe that can be neglected for now.

@Illviljan @dcherian This is ready for another round of review. Thanks for considering.

xarray/core/concat.py Show resolved Hide resolved
xarray/tests/test_concat.py Outdated Show resolved Hide resolved
xarray/tests/test_concat.py Outdated Show resolved Hide resolved
@kmuehlbauer
Copy link
Contributor Author

I was hoping to gain something by merging the variable order code with _parse_datasets, to only have to traverse the datasets once.

The current behaviour, and the best I've come up so far in terms of performance:

  1. count number of variables while iterating datasets (_parse_datasets)
  2. check if first dataset contains all wanted variables
    2a. if that's the case, take the order from first dataset
  3. check if the dataset with max count variables contains all wanted variables
    3a. if that's the case, take the order from that dataset
  4. if not 2a or 3a, take order from first dataset and append missing variables to the end

xarray/core/concat.py Outdated Show resolved Hide resolved
@kmuehlbauer
Copy link
Contributor Author

Finally, this is as far I could get with it. I'll leave it as is now. Looking forward for reviews and suggestions. Thanks @Illviljan for the great support!

@Illviljan Illviljan added the run-benchmark Run the ASV benchmark workflow label Jan 10, 2023
@kmuehlbauer kmuehlbauer reopened this Jan 10, 2023
@github-actions github-actions bot removed the run-benchmark Run the ASV benchmark workflow label Jan 10, 2023
@kmuehlbauer
Copy link
Contributor Author

@dcherian rebased on latest main and fixed whats-new.rst. Should be good for another review.

Copy link
Contributor

@dcherian dcherian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kmuehlbauer and @scottcha . This is great to see finished! Very clean implementation in the end.

@dcherian dcherian added the plan to merge Final call for comments label Jan 19, 2023
@kmuehlbauer
Copy link
Contributor Author

@dcherian There slipped an old item from whats-new.rst back into. I've removed it. Should be OK now.

Great to see this functionality coming to next xarray version.

@dcherian dcherian changed the title ENH: fill missing variables during concat by reindexing Fill missing data_vars during concat by reindexing Jan 20, 2023
@dcherian dcherian merged commit b4e3cbc into pydata:main Jan 20, 2023
dcherian added a commit to dcherian/xarray that referenced this pull request Jan 25, 2023
* upstream/main:
  RTD maintenance (pydata#7477)
  fix the RTD build skipping feature (pydata#7476)
  Add benchmarks for to_dataframe and to_dask_dataframe (pydata#7474)
  allow skipping RTD builds (pydata#7470)
  create separate environment files for `python=3.11` (pydata#7469)
  Bump mamba-org/provision-with-micromamba from 14 to 15 (pydata#7466)
  install `numbagg` from `conda-forge` (pydata#7415)
  Fill missing data_vars during concat by reindexing (pydata#7400)
  [skip-cii] Add pyodide update instructions to HOW_TO_RELEASE (pydata#7449)
  [skip-ci] whats-new for next release (pydata#7455)
  v2023.01.0 whats-new (pydata#7440)
@kmuehlbauer kmuehlbauer deleted the fix-issue-508 branch May 25, 2023 07:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
plan to merge Final call for comments run-benchmark Run the ASV benchmark workflow
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Ignore missing variables when concatenating datasets?
5 participants