Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Granular BitGroom feature for netcdf-c #2130

Merged
merged 5 commits into from
Jan 14, 2022
Merged

Granular BitGroom feature for netcdf-c #2130

merged 5 commits into from
Jan 14, 2022

Conversation

czender
Copy link
Contributor

@czender czender commented Oct 20, 2021

Granular BitGroom (GBG) combines features of BitGroom, BitRound by Kouznetsov (2020), and DigitRound by Delaunay et al. (2019). GBG improves compression ratios by ~20% relative to BitGroom for NSD=3 on our benchmark 1 GB climate model output dataset. Its invocation is identical to BitGroom, so this patchset mainly utilizes a new enumerated value of the quantize_mode flag to invoke the new algorithm. No tests (yet) in this patchset. For correctness, GBG can be compared to current implementations in NCO and CCR.

@czender czender requested a review from WardF as a code owner October 20, 2021 23:10
@czender czender marked this pull request as draft October 20, 2021 23:10
@edwardhartnett
Copy link
Contributor

@czender do you have cites for the two papers your reference?

@czender
Copy link
Contributor Author

czender commented Oct 21, 2021

@edwardhartnett If you mean the following two, then yes, and they are also on the CCR homepage:

Delaunay, X., A. Courtois, and F. Gouillon (2019), Evaluation of
lossless and lossy algorithms for the compression of scientific
datasets in netCDF-4 or HDF5 files, Geosci. Model Dev., 12(9),
4099-4113, doi:10.5194/gmd-2018-250

Kouznetsov, R. (2021), A note on precision-preserving compression of scientific data, Geosci. Model Dev., 14(1), 377-389, https://doi.org/10.5194/gmd-14-377-2021

@czender czender marked this pull request as ready for review October 21, 2021 19:03
@czender czender changed the title First draft of Granular BitGroom feature for netcdf-c Granular BitGroom feature for netcdf-c Oct 21, 2021
@DennisHeimbigner
Copy link
Collaborator

Is the number of algorithms likely to grow significantly,
If so then I must object to having a different attribute name for every algorithm.

@edwardhartnett
Copy link
Contributor

I believe not more than a few algorithms are envisioned, right @czender ?

But I wonder why we don't just take the best algorithm and use that? Do we need the old one, once an improvement has come along?

@czender
Copy link
Contributor Author

czender commented Oct 21, 2021

@DennisHeimbigner I do not intend to try to add anymore quantization algorithms. After BitGroom was published in 2016, it inspired others to optimize quantization algorithms even further, within the constraints of guaranteeing the user-specified NSD and keeping IEEE on-disk format so no decoder necessary. This GBG algorithm incorporates that progress, which significantly improves compression ratio (CR), something like 20% better than BG for NSD=3 (followed by DEFLATE). The "low-hanging fruit" are in GBG, so I think algorithmic improvements cannot improve GBG CR by more than 5% without violating the above constraints. I thought it important that netcdf-c not be limited to BG given the known improvements that were possible, so once Ed put in BG, it prompted me to develop and submit a "best of" algorithm to netcdf-c rather than see it languish in NCO or CCR.
@edwardhartnett Offering multiple quantization options to users in netcdf-c may not be helpful to anyone. Most users will not understand the trade-offs of different algorithms, and those who really care or wish to intercompare algorithms could use NCO or CCR. Nevertheless, "we" decided that a quantize_mode enum was wise, so the natural path for GBG was to use that. I understand Dennis' reservation about attribute name proliferation. The options seem to be to 1) Use the existing attribute name convention but limit the number of algorithms. 2) Change the convention so that all NSD-based quantization algorithms use the same attribute name. 3) A combination of 1 and 2 where netcdf-c only has one quantization algorithm (and thus one attribute name), and allow the algorithm to change incrementally under the hood when justified. Any of those seem fine to me. FWIW, NCO quantization defaults to GBG now.
GBG yields better CR than BG for small NSD so its accuracy is corresponding worse than BG. However, users essentially pick the desired quantization error when they choose NSD for a variable. Is there any point to giving users two choices (algorithm and NSD) instead of one (NSD)? Maybe, maybe not. Anyway, this is clearly an important topic to seek consensus on before releasing 4.8.2. Thanks for raising these questions.

@WardF
Copy link
Member

WardF commented Nov 4, 2021

Clearing out the PR backlog, it appears that the quantize test is giving an 'unexpected error'. The autotools-based test is silent, I will adjust that to provide additional information in the case of failure, but the cmake test is at least a little bit more verbose about it.

@czender
Copy link
Contributor Author

czender commented Nov 4, 2021

Thanks, @WardF. I see why the test fails and it's an easy fix. It fails because it expects an error if NC_QUANTIZE_BITGROOM is NOT the last quantize mode defined. Now NC_QUANTIZE_GRANULARBG is the last quantize mode defined, so the test that previously passed should now fail, as observed. Not sure why this was silently passing in the autoconf-based testing I did earlier, though. I think the simplest fix is to change the test so it tries to access an undefined quantize mode one greater than NC_QUANTIZE_GRANULARBG, and I will submit a patch for that soon. Another route would be to add another token, e.g., NC_QUANTIZE_MODE_MAX defined as the greatest valid enumerated value of QUANTIZE modes. I hesitate to do that because my sense is that adding tokens is frowned upon.

…ANULARBG (instead of NC_QUANTIZE_BITGROOM) fails.
@WardF WardF merged commit 3980d76 into Unidata:main Jan 14, 2022
DennisHeimbigner added a commit to DennisHeimbigner/netcdf-c that referenced this pull request Jan 24, 2022
re: PR Unidata#2088
re: PR Unidata#2130
replaces: Unidata#2140

Changes:
* Add NCZarr-specific quantize functions to the dispatch table.
* Copy (modified) quantize code from libhdf5 to NCZarr
* Add quantize invocation to zvar.c
* Add support for _QuantizeBitgroomNumberOfSignificantDigits
and _QuantizeGranularBitgroomNumberOfSignificantDigits to ncgen.
* Modify nc_test4/tst_quantize.c to allow it to be used both for hdf5
  and for nczarr.
* Make dap4 properly handle quantize functions in dispatch table.
* Add quantize attribute support to ncgen.

Other changes:
* Caught and fixed some S3 problems
* Fixed some nczarr fillvalue problems.
* Fixed some nczarr cache problems.
* Cleanup some flaws in libdispatch/dinfermodel.c
* Allow byterange requests to S3 be readable by dinfermodel.c/check_file_type
* Remove the libnczarr ztracedispatch code (big change).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants