Combine `UnsignedIntegerCoder` and `CFMaskCoder` #9274

djhoese · 2024-07-24T19:34:19Z

See #9266 and the related issues for the detailed discussion. Bottom line is that CF _Unsigned handling gets weird when handling _FillValue. The CF standard says that _FillValue on disk should always match the array's type on disk. However, when xarray loads the data and masks it, it does this with the in-memory unsigned integer data. Currently in main this is handled by a combination of the UnsignedIntegerCoder class and the CFMaskCoder class. It unfortunately requires the "decoded" unsigned _FillValue to be stored in the loaded variable so it can be used by the masking code. But to match CF standards this _FillValue should remain as the on-disk signed type. In this PR I combine the two classes to avoid storing the temporary unsigned version of _FillValue and only use it for masking.

Other important changes:

A serialization warning is added to inform users that the _FillValue they have passed does not match expectations for the CF standard.
_FillValue values that are numpy scalar types (ex. np.uint8(255)) are always converted to native python integers before being encoded for _Unsigned variables by calling .item() on them. This ensures the above serialization warning is always issued as numpy will silently cast a uint8 (ex. 255) to a int8 (ex. -1) without warning.

At the time of writing this code needs lots of cleanup or at least I hope it can because it is hard to follow in my opinion...but maybe that's just the way CF handling code is.

Closes #xxxx
Tests added
User visible changes (including notable bug fixes) are documented in whats-new.rst
New functions/methods are listed in api.rst

xarray/coding/variables.py

djhoese · 2024-07-26T16:23:30Z

xarray/coding/variables.py

+            transform = partial(np.asarray, dtype=signed_dtype)
+            data = lazy_elemwise_func(data, transform, signed_dtype)
+            if raw_fill_value is not None:
+                new_fill = signed_dtype.type(raw_fill_value)


This block didn't use @kmuehlbauer's trick for handling overflow by using .view. I'm wondering if I need to use that here, but no tests hit it. I've never actually seen _Unsigned == "false" in the wild.

If I change that here then I think I might be able to shrink this function and do things as "old dtype" and "new dtype" rather than signed versus unsigned.

This block didn't use @kmuehlbauer's trick for handling overflow by using .view. I'm wondering if I need to use that here, but no tests hit it. I've never actually seen _Unsigned == "false" in the wild.

That was requested at some point in time for a specific use case. I'll try to dig it up.

OK looking at the tests, it looks like this case is not tested (but I'm triple checking). This set of ifs says that the data on-disk is unsigned, but _Unsigned is false which means the user wants signed data in-memory. The tests for _Unsigned="false" have signed data on-disk and in-memory so no casting/conversion happens.

Edit: Scratch that. test_backends doesn't test it, but test_coding does but never uses _FillValue. Let's see what I can do.

Well I thought I was being smart and added tests for the _Unsigned: "false" case and now things are just all weird. It made me realize I wasn't handling this case in the encoding step, but it also kind of seems like it never was handled or that I have the wrong impression of how that configuration is supposed to be handled. The tests added in that PR @kmuehlbauer don't present the _FillValue so I've tried adding that to the backend tests but now non-NC4 backends are complaining about converting uint8 to int8. This coercion was added to solve #4014 it seems.

My assumption is that if _Unsigned: "false" then the data is saved as uint8 and _FillValue should be uint8. But again, I'm not sure why my new tests are even trying to get to int8. I'll do some more testing later tonight hopefully. I'm not sure if it is better to spend a ton of time getting this small functionality working "as expected" or leave it undefined/untested as it is right now.

So my main concern and all of your handling in the decode pipeline for casting fill values turns out to not be an issue anymore as the raw fill values turn out to be numpy scalars (np.uint8) when they get loaded from the file. Or at least they are for the NetCDF4 cases. So numpy is perfectly happy casting uint8 to int8 and back if they are already numpy scalars. If my new tests cases make sense then I think this is fine.

djhoese · 2024-08-02T16:01:42Z

I've run out of time to really work on refactoring this more. If people are able to review it in its current state that would be great. I'm not sure if it is in a good enough place to be merged, but I'll let the reviewers decide.

kmuehlbauer · 2024-08-09T07:48:55Z

Thanks @djhoese for tackling this hard part of the CF conventions. This is looking good to me. Let's have another CI run now.

kmuehlbauer · 2024-08-09T09:31:06Z

Test failures are due to #9327.

dcherian · 2024-08-20T16:13:50Z

Kai, please merge if you're comfortable with this!

kmuehlbauer · 2024-08-20T16:55:50Z

Thanks @djhoese, for going down that rabbit hole. Finger's crossed 😬.

* main: (214 commits) Adds copy parameter to __array__ for numpy 2.0 (pydata#9393) `numpy 2` compatibility in the `pydap` backend (pydata#9391) pyarrow dependency added to doc environment (pydata#9394) Extend padding functionalities (pydata#9353) refactor GroupBy internals (pydata#9389) Combine `UnsignedIntegerCoder` and `CFMaskCoder` (pydata#9274) passing missing parameters to ZarrStore.open_store when opening a datatree (pydata#9377) Fix tests on big-endian systems (pydata#9380) Improve error message on `ds['x', 'y']` (pydata#9375) Improve error message for missing coordinate index (pydata#9370) Add flaky to TestNetCDF4ViaDaskData (pydata#9373) Make chunk manager an option in `set_options` (pydata#9362) Revise (pydata#9371) Remove duplicate word from docs (pydata#9367) Adding open_groups to BackendEntryPointEngine, NetCDF4BackendEntrypoint, and H5netcdfBackendEntrypoint (pydata#9243) Revise (pydata#9366) Fix rechunking to a frequency with empty bins. (pydata#9364) whats-new entry for dropping python 3.9 (pydata#9359) drop support for `python=3.9` (pydata#8937) Revise (pydata#9357) ...

* main: Adds copy parameter to __array__ for numpy 2.0 (pydata#9393) `numpy 2` compatibility in the `pydap` backend (pydata#9391) pyarrow dependency added to doc environment (pydata#9394) Extend padding functionalities (pydata#9353) refactor GroupBy internals (pydata#9389) Combine `UnsignedIntegerCoder` and `CFMaskCoder` (pydata#9274) passing missing parameters to ZarrStore.open_store when opening a datatree (pydata#9377) Fix tests on big-endian systems (pydata#9380) Improve error message on `ds['x', 'y']` (pydata#9375)

djhoese added 2 commits July 24, 2024 14:34

Fix small typo in docstring

e1baa93

Combine CF Unsigned and Mask handling

cae77aa

djhoese force-pushed the refactor-unsigned-masked-cf branch from 083c6b1 to cae77aa Compare July 24, 2024 19:34

djhoese added 2 commits July 25, 2024 21:16

Replace UnsignedIntegerCode tests with CFMaskCoder usage

a37c5d0

Fix dtype type annotation

39b65c5

dcherian reviewed Jul 26, 2024

View reviewed changes

xarray/coding/variables.py Show resolved Hide resolved

djhoese added 2 commits July 26, 2024 10:45

Fix when unsigned serialization warning is expected in tests

b72ded9

Small refactor of CFMaskCoder decoding

e6e71e2

djhoese commented Jul 26, 2024

View reviewed changes

Add CF encoder tests for _Unsigned=false cases

c996918

djhoese marked this pull request as ready for review August 2, 2024 14:50

djhoese added 2 commits August 2, 2024 09:51

Merge branch 'main' into refactor-unsigned-masked-cf

a8eb418

Remove UnsignedIntegerCoder from api docs

bdc122e

Illviljan added the run-benchmark Run the ASV benchmark workflow label Aug 5, 2024

Merge branch 'main' into refactor-unsigned-masked-cf

dea730f

kmuehlbauer approved these changes Aug 9, 2024

View reviewed changes

dcherian and others added 3 commits August 13, 2024 20:37

Merge branch 'main' into refactor-unsigned-masked-cf

aecc9fa

Merge branch 'main' into refactor-unsigned-masked-cf

01a4742

Merge branch 'main' into refactor-unsigned-masked-cf

5aae952

kmuehlbauer merged commit 4ab0679 into pydata:main Aug 20, 2024
28 checks passed

dcherian mentioned this pull request Aug 26, 2024

Combine UnsignedIntegerCoder and CFMaskCoder #9266

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Combine `UnsignedIntegerCoder` and `CFMaskCoder` #9274

Combine `UnsignedIntegerCoder` and `CFMaskCoder` #9274

djhoese commented Jul 24, 2024

djhoese Jul 26, 2024

djhoese Jul 26, 2024

kmuehlbauer Jul 26, 2024

djhoese Jul 26, 2024 •

edited

Loading

kmuehlbauer Jul 26, 2024

djhoese Jul 26, 2024

djhoese Aug 2, 2024

djhoese commented Aug 2, 2024

kmuehlbauer commented Aug 9, 2024

kmuehlbauer commented Aug 9, 2024

dcherian commented Aug 20, 2024

kmuehlbauer commented Aug 20, 2024

Combine UnsignedIntegerCoder and CFMaskCoder #9274

Combine UnsignedIntegerCoder and CFMaskCoder #9274

Conversation

djhoese commented Jul 24, 2024

djhoese Jul 26, 2024

Choose a reason for hiding this comment

djhoese Jul 26, 2024

Choose a reason for hiding this comment

kmuehlbauer Jul 26, 2024

Choose a reason for hiding this comment

djhoese Jul 26, 2024 • edited Loading

Choose a reason for hiding this comment

kmuehlbauer Jul 26, 2024

Choose a reason for hiding this comment

djhoese Jul 26, 2024

Choose a reason for hiding this comment

djhoese Aug 2, 2024

Choose a reason for hiding this comment

djhoese commented Aug 2, 2024

kmuehlbauer commented Aug 9, 2024

kmuehlbauer commented Aug 9, 2024

dcherian commented Aug 20, 2024

kmuehlbauer commented Aug 20, 2024

Combine `UnsignedIntegerCoder` and `CFMaskCoder` #9274

Combine `UnsignedIntegerCoder` and `CFMaskCoder` #9274

djhoese Jul 26, 2024 •

edited

Loading