Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Combine UnsignedIntegerCoder and CFMaskCoder #9274

Merged
merged 13 commits into from
Aug 20, 2024

Conversation

djhoese
Copy link
Contributor

@djhoese djhoese commented Jul 24, 2024

See #9266 and the related issues for the detailed discussion. Bottom line is that CF _Unsigned handling gets weird when handling _FillValue. The CF standard says that _FillValue on disk should always match the array's type on disk. However, when xarray loads the data and masks it, it does this with the in-memory unsigned integer data. Currently in main this is handled by a combination of the UnsignedIntegerCoder class and the CFMaskCoder class. It unfortunately requires the "decoded" unsigned _FillValue to be stored in the loaded variable so it can be used by the masking code. But to match CF standards this _FillValue should remain as the on-disk signed type. In this PR I combine the two classes to avoid storing the temporary unsigned version of _FillValue and only use it for masking.

Other important changes:

  1. A serialization warning is added to inform users that the _FillValue they have passed does not match expectations for the CF standard.
  2. _FillValue values that are numpy scalar types (ex. np.uint8(255)) are always converted to native python integers before being encoded for _Unsigned variables by calling .item() on them. This ensures the above serialization warning is always issued as numpy will silently cast a uint8 (ex. 255) to a int8 (ex. -1) without warning.

At the time of writing this code needs lots of cleanup or at least I hope it can because it is hard to follow in my opinion...but maybe that's just the way CF handling code is.

  • Closes #xxxx
  • Tests added
  • User visible changes (including notable bug fixes) are documented in whats-new.rst
  • New functions/methods are listed in api.rst

@djhoese djhoese force-pushed the refactor-unsigned-masked-cf branch from 083c6b1 to cae77aa Compare July 24, 2024 19:34
transform = partial(np.asarray, dtype=signed_dtype)
data = lazy_elemwise_func(data, transform, signed_dtype)
if raw_fill_value is not None:
new_fill = signed_dtype.type(raw_fill_value)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This block didn't use @kmuehlbauer's trick for handling overflow by using .view. I'm wondering if I need to use that here, but no tests hit it. I've never actually seen _Unsigned == "false" in the wild.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I change that here then I think I might be able to shrink this function and do things as "old dtype" and "new dtype" rather than signed versus unsigned.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This block didn't use @kmuehlbauer's trick for handling overflow by using .view. I'm wondering if I need to use that here, but no tests hit it. I've never actually seen _Unsigned == "false" in the wild.

That was requested at some point in time for a specific use case. I'll try to dig it up.

Copy link
Contributor Author

@djhoese djhoese Jul 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK looking at the tests, it looks like this case is not tested (but I'm triple checking). This set of ifs says that the data on-disk is unsigned, but _Unsigned is false which means the user wants signed data in-memory. The tests for _Unsigned="false" have signed data on-disk and in-memory so no casting/conversion happens.

Edit: Scratch that. test_backends doesn't test it, but test_coding does but never uses _FillValue. Let's see what I can do.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xref #4966

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well I thought I was being smart and added tests for the _Unsigned: "false" case and now things are just all weird. It made me realize I wasn't handling this case in the encoding step, but it also kind of seems like it never was handled or that I have the wrong impression of how that configuration is supposed to be handled. The tests added in that PR @kmuehlbauer don't present the _FillValue so I've tried adding that to the backend tests but now non-NC4 backends are complaining about converting uint8 to int8. This coercion was added to solve #4014 it seems.

My assumption is that if _Unsigned: "false" then the data is saved as uint8 and _FillValue should be uint8. But again, I'm not sure why my new tests are even trying to get to int8. I'll do some more testing later tonight hopefully. I'm not sure if it is better to spend a ton of time getting this small functionality working "as expected" or leave it undefined/untested as it is right now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So my main concern and all of your handling in the decode pipeline for casting fill values turns out to not be an issue anymore as the raw fill values turn out to be numpy scalars (np.uint8) when they get loaded from the file. Or at least they are for the NetCDF4 cases. So numpy is perfectly happy casting uint8 to int8 and back if they are already numpy scalars. If my new tests cases make sense then I think this is fine.

@djhoese djhoese marked this pull request as ready for review August 2, 2024 14:50
@djhoese
Copy link
Contributor Author

djhoese commented Aug 2, 2024

I've run out of time to really work on refactoring this more. If people are able to review it in its current state that would be great. I'm not sure if it is in a good enough place to be merged, but I'll let the reviewers decide.

@Illviljan Illviljan added the run-benchmark Run the ASV benchmark workflow label Aug 5, 2024
@kmuehlbauer
Copy link
Contributor

Thanks @djhoese for tackling this hard part of the CF conventions. This is looking good to me. Let's have another CI run now.

@kmuehlbauer
Copy link
Contributor

Test failures are due to #9327.

@dcherian
Copy link
Contributor

Kai, please merge if you're comfortable with this!

@kmuehlbauer kmuehlbauer merged commit 4ab0679 into pydata:main Aug 20, 2024
28 checks passed
@kmuehlbauer
Copy link
Contributor

Thanks @djhoese, for going down that rabbit hole. Finger's crossed 😬.

dcherian added a commit to TomNicholas/xarray that referenced this pull request Aug 22, 2024
* main: (214 commits)
  Adds copy parameter to __array__ for numpy 2.0 (pydata#9393)
  `numpy 2` compatibility in the `pydap` backend (pydata#9391)
  pyarrow dependency added to doc environment (pydata#9394)
  Extend padding functionalities (pydata#9353)
  refactor GroupBy internals (pydata#9389)
  Combine `UnsignedIntegerCoder` and `CFMaskCoder` (pydata#9274)
  passing missing parameters to ZarrStore.open_store when opening a datatree (pydata#9377)
  Fix tests on big-endian systems (pydata#9380)
  Improve error message on `ds['x', 'y']` (pydata#9375)
  Improve error message for missing coordinate index (pydata#9370)
  Add flaky to TestNetCDF4ViaDaskData (pydata#9373)
  Make chunk manager an option in `set_options` (pydata#9362)
  Revise (pydata#9371)
  Remove duplicate word from docs (pydata#9367)
  Adding open_groups to BackendEntryPointEngine, NetCDF4BackendEntrypoint, and H5netcdfBackendEntrypoint (pydata#9243)
  Revise (pydata#9366)
  Fix rechunking to a frequency with empty bins. (pydata#9364)
  whats-new entry for dropping python 3.9 (pydata#9359)
  drop support for `python=3.9` (pydata#8937)
  Revise (pydata#9357)
  ...
dcherian added a commit to dcherian/xarray that referenced this pull request Aug 30, 2024
* main:
  Adds copy parameter to __array__ for numpy 2.0 (pydata#9393)
  `numpy 2` compatibility in the `pydap` backend (pydata#9391)
  pyarrow dependency added to doc environment (pydata#9394)
  Extend padding functionalities (pydata#9353)
  refactor GroupBy internals (pydata#9389)
  Combine `UnsignedIntegerCoder` and `CFMaskCoder` (pydata#9274)
  passing missing parameters to ZarrStore.open_store when opening a datatree (pydata#9377)
  Fix tests on big-endian systems (pydata#9380)
  Improve error message on `ds['x', 'y']` (pydata#9375)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
run-benchmark Run the ASV benchmark workflow
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants