Granular BitGroom feature for netcdf-c #2130

czender · 2021-10-20T23:10:30Z

Granular BitGroom (GBG) combines features of BitGroom, BitRound by Kouznetsov (2020), and DigitRound by Delaunay et al. (2019). GBG improves compression ratios by ~20% relative to BitGroom for NSD=3 on our benchmark 1 GB climate model output dataset. Its invocation is identical to BitGroom, so this patchset mainly utilizes a new enumerated value of the quantize_mode flag to invoke the new algorithm. No tests (yet) in this patchset. For correctness, GBG can be compared to current implementations in NCO and CCR.

edwardhartnett · 2021-10-21T11:07:52Z

@czender do you have cites for the two papers your reference?

czender · 2021-10-21T15:59:45Z

@edwardhartnett If you mean the following two, then yes, and they are also on the CCR homepage:

Delaunay, X., A. Courtois, and F. Gouillon (2019), Evaluation of
lossless and lossy algorithms for the compression of scientific
datasets in netCDF-4 or HDF5 files, Geosci. Model Dev., 12(9),
4099-4113, doi:10.5194/gmd-2018-250

Kouznetsov, R. (2021), A note on precision-preserving compression of scientific data, Geosci. Model Dev., 14(1), 377-389, https://doi.org/10.5194/gmd-14-377-2021

…y to fix syntax bugs

DennisHeimbigner · 2021-10-21T19:35:00Z

Is the number of algorithms likely to grow significantly,
If so then I must object to having a different attribute name for every algorithm.

edwardhartnett · 2021-10-21T20:01:28Z

I believe not more than a few algorithms are envisioned, right @czender ?

But I wonder why we don't just take the best algorithm and use that? Do we need the old one, once an improvement has come along?

czender · 2021-10-21T20:46:34Z

@DennisHeimbigner I do not intend to try to add anymore quantization algorithms. After BitGroom was published in 2016, it inspired others to optimize quantization algorithms even further, within the constraints of guaranteeing the user-specified NSD and keeping IEEE on-disk format so no decoder necessary. This GBG algorithm incorporates that progress, which significantly improves compression ratio (CR), something like 20% better than BG for NSD=3 (followed by DEFLATE). The "low-hanging fruit" are in GBG, so I think algorithmic improvements cannot improve GBG CR by more than 5% without violating the above constraints. I thought it important that netcdf-c not be limited to BG given the known improvements that were possible, so once Ed put in BG, it prompted me to develop and submit a "best of" algorithm to netcdf-c rather than see it languish in NCO or CCR.
@edwardhartnett Offering multiple quantization options to users in netcdf-c may not be helpful to anyone. Most users will not understand the trade-offs of different algorithms, and those who really care or wish to intercompare algorithms could use NCO or CCR. Nevertheless, "we" decided that a quantize_mode enum was wise, so the natural path for GBG was to use that. I understand Dennis' reservation about attribute name proliferation. The options seem to be to 1) Use the existing attribute name convention but limit the number of algorithms. 2) Change the convention so that all NSD-based quantization algorithms use the same attribute name. 3) A combination of 1 and 2 where netcdf-c only has one quantization algorithm (and thus one attribute name), and allow the algorithm to change incrementally under the hood when justified. Any of those seem fine to me. FWIW, NCO quantization defaults to GBG now.
GBG yields better CR than BG for small NSD so its accuracy is corresponding worse than BG. However, users essentially pick the desired quantization error when they choose NSD for a variable. Is there any point to giving users two choices (algorithm and NSD) instead of one (NSD)? Maybe, maybe not. Anyway, this is clearly an important topic to seek consensus on before releasing 4.8.2. Thanks for raising these questions.

WardF · 2021-11-04T21:56:47Z

Clearing out the PR backlog, it appears that the quantize test is giving an 'unexpected error'. The autotools-based test is silent, I will adjust that to provide additional information in the case of failure, but the cmake test is at least a little bit more verbose about it.

czender · 2021-11-04T23:08:29Z

Thanks, @WardF. I see why the test fails and it's an easy fix. It fails because it expects an error if NC_QUANTIZE_BITGROOM is NOT the last quantize mode defined. Now NC_QUANTIZE_GRANULARBG is the last quantize mode defined, so the test that previously passed should now fail, as observed. Not sure why this was silently passing in the autoconf-based testing I did earlier, though. I think the simplest fix is to change the test so it tries to access an undefined quantize mode one greater than NC_QUANTIZE_GRANULARBG, and I will submit a patch for that soon. Another route would be to add another token, e.g., NC_QUANTIZE_MODE_MAX defined as the greatest valid enumerated value of QUANTIZE modes. I hesitate to do that because my sense is that adding tokens is frowned upon.

…ANULARBG (instead of NC_QUANTIZE_BITGROOM) fails.

re: PR Unidata#2088 re: PR Unidata#2130 replaces: Unidata#2140 Changes: * Add NCZarr-specific quantize functions to the dispatch table. * Copy (modified) quantize code from libhdf5 to NCZarr * Add quantize invocation to zvar.c * Add support for _QuantizeBitgroomNumberOfSignificantDigits and _QuantizeGranularBitgroomNumberOfSignificantDigits to ncgen. * Modify nc_test4/tst_quantize.c to allow it to be used both for hdf5 and for nczarr. * Make dap4 properly handle quantize functions in dispatch table. * Add quantize attribute support to ncgen. Other changes: * Caught and fixed some S3 problems * Fixed some nczarr fillvalue problems. * Fixed some nczarr cache problems. * Cleanup some flaws in libdispatch/dinfermodel.c * Allow byterange requests to S3 be readable by dinfermodel.c/check_file_type * Remove the libnczarr ztracedispatch code (big change).

First draft of Granular BitGroom feature for netcdf-c

fb70b4c

czender requested a review from WardF as a code owner October 20, 2021 23:10

czender marked this pull request as draft October 20, 2021 23:10

czender added 3 commits October 21, 2021 10:33

Eliminate GBG-specific initialization, pad syntax with whitespace, tr…

279c34b

…y to fix syntax bugs

add missing variables

e609762

Change NC_QUANTIZE_ATT_NAME to NC_QUANTIZE_BITGROOM_ATT_NAME

e7394af

edwardhartnett approved these changes Oct 21, 2021

View reviewed changes

czender marked this pull request as ready for review October 21, 2021 19:03

czender changed the title ~~First draft of Granular BitGroom feature for netcdf-c~~ Granular BitGroom feature for netcdf-c Oct 21, 2021

Change test to verify that using quantize mode one greater than NC_GR…

48560bf

…ANULARBG (instead of NC_QUANTIZE_BITGROOM) fails.

jswhit mentioned this pull request Nov 12, 2021

add support for quantization/bit-grooming in netcdf-c 4.8.2 Unidata/netcdf4-python#1140

Merged

WardF merged commit 3980d76 into Unidata:main Jan 14, 2022

DennisHeimbigner mentioned this pull request Jan 24, 2022

Add complete bitgroom support to NCZarr #2197

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Granular BitGroom feature for netcdf-c #2130

Granular BitGroom feature for netcdf-c #2130

czender commented Oct 20, 2021

edwardhartnett commented Oct 21, 2021

czender commented Oct 21, 2021

DennisHeimbigner commented Oct 21, 2021

edwardhartnett commented Oct 21, 2021

czender commented Oct 21, 2021

WardF commented Nov 4, 2021

czender commented Nov 4, 2021

Granular BitGroom feature for netcdf-c #2130

Granular BitGroom feature for netcdf-c #2130

Conversation

czender commented Oct 20, 2021

edwardhartnett commented Oct 21, 2021

czender commented Oct 21, 2021

DennisHeimbigner commented Oct 21, 2021

edwardhartnett commented Oct 21, 2021

czender commented Oct 21, 2021

WardF commented Nov 4, 2021

czender commented Nov 4, 2021