diff --git a/RELEASE_NOTES.md b/RELEASE_NOTES.md index d0ce10764d..f876461b14 100644 --- a/RELEASE_NOTES.md +++ b/RELEASE_NOTES.md @@ -7,6 +7,7 @@ This file contains a high-level description of this package's evolution. Release ## 4.8.2 - TBD +* [Enhancement] Update the documentation to match the current filter capabilities See [Github #2249](https://github.com/Unidata/netcdf-c/pull/2249). * [Enhancement] Support installation of pre-built standard filters into user-specified location. See [Github #2318](https://github.com/Unidata/netcdf-c/pull/2318). * [Enhancement] Improve filter support. More specifically (1) add nc_inq_filter_avail to check if a filter is available, (2) add the notion of standard filters, (3) cleanup szip support to fix interaction with NCZarr. See [Github #2245](https://github.com/Unidata/netcdf-c/pull/2245). * [Enhancement] Switch to tinyxml2 as the default xml parser implementation. See [Github #2170](https://github.com/Unidata/netcdf-c/pull/2170). diff --git a/docs/Doxyfile.in b/docs/Doxyfile.in index fcb2b33680..54f504401d 100644 --- a/docs/Doxyfile.in +++ b/docs/Doxyfile.in @@ -754,7 +754,8 @@ INPUT = \ @abs_top_srcdir@/docs/COPYRIGHT.md \ @abs_top_srcdir@/docs/credits.md \ @abs_top_srcdir@/docs/tutorial.dox \ - @abs_top_srcdir@/docs/internal.dox \ + @abs_top_srcdir@/docs/internal.md \ + @abs_top_srcdir@/docs/dispatch.md \ @abs_top_srcdir@/docs/inmeminternal.dox \ @abs_top_srcdir@/docs/indexing.dox \ @abs_top_srcdir@/docs/testserver.dox \ diff --git a/docs/FAQ.md b/docs/FAQ.md index 380dfda7e0..511cff0429 100644 --- a/docs/FAQ.md +++ b/docs/FAQ.md @@ -1079,9 +1079,22 @@ and writable by programs that used older versions of the libraries. However, programs linked to older library versions will not be able to create new data objects with the new less-restrictive names. -How difficult is it to convert my application to handle arbitrary netCDF-4 files? {#How-difficult-is-it-to-convert-my-application-to-handle-arbitrary-netCDF-4-files} +Can I use UTF-8 File Names with Windows? {#Can-I-use-UTF-8-File-Names-with-Windows} ----------------- +Starting with Windows 10 build 17134, Windows can support use of +the UTF-8 character set. We strongly encourage Windows users to +enable this feature. This requires the following steps. + +1. In the "run" toolbar, execute the command "intl.cpl". +2. Move to the Administrative tab. +3. Move to "Change system locale" +4. Check the box at the bottom labeled something like +"Beta: Use Unicode UTF-8 for worldwide language support" + + +How difficult is it to convert my application to handle arbitrary netCDF-4 files? {#How-difficult-is-it-to-convert-my-application-to-handle-arbitrary-netCDF-4-files} +----------------- Modifying an application to fully support the new enhanced data model may be relatively easy or arbitrarily difficult :-), depending on what diff --git a/docs/Makefile.am b/docs/Makefile.am index 9f851abc9d..321e03727a 100644 --- a/docs/Makefile.am +++ b/docs/Makefile.am @@ -9,7 +9,7 @@ # These files will be included with the dist. EXTRA_DIST = netcdf.m4 DoxygenLayout.xml Doxyfile.in footer.html \ mainpage.dox tutorial.dox \ -architecture.dox internal.dox windows-binaries.md \ +architecture.dox internal.md windows-binaries.md dispatch.md \ building-with-cmake.md CMakeLists.txt groups.dox notes.md \ install-fortran.md all-error-codes.md credits.md auth.md filters.md \ obsolete/fan_utils.html indexing.dox \ diff --git a/docs/dispatch.md b/docs/dispatch.md new file mode 100644 index 0000000000..a700e36dd7 --- /dev/null +++ b/docs/dispatch.md @@ -0,0 +1,507 @@ +Internal Dispatch Table Architecture +============================ + + +# Internal Dispatch Table Architecture + +\tableofcontents + +# Introduction {#dispatch_intro} + +The netcdf-c library uses an internal dispatch mechanism +as the means for wrapping the netcdf-c API around a wide variety +of underlying storage and stream data formats. +As of last check, the following formats are supported and each +has its own dispatch table. + +Warning: some of the listed function signatures may be out of date +and the specific code should be consulted to see the actual parameters. + + +
FormatDirectoryNC_FORMATX Name +
NetCDF-classiclibsrcNC_FORMATX_NC3 +
NetCDF-enhancedlibhdf5NC_FORMATX_NC_HDF5 +
HDF4libhdf4NC_FORMATX_NC_HDF4 +
PNetCDFlibsrcpNC_FORMATX_PNETCDF +
DAP2libdap2NC_FORMATX_DAP2 +
DAP4libdap4NC_FORMATX_DAP4 +
UDF0N.A.NC_FORMATX_UDF0 +
UDF1N.A.NC_FORMATX_UDF1 +
NCZarrlibnczarrNC_FORMATX_NCZARR +
+ +Note that UDF0 and UDF1 allow for user-defined dispatch tables to +be implemented. + +The idea is that when a user opens or creates a netcdf file, a +specific dispatch table is chosen. A dispatch table is a struct +containing an entry for (almost) every function in the netcdf-c API. +During execution, netcdf API calls are channeled through that +dispatch table to the appropriate function for implementing that +API call. The functions in the dispatch table are not quite the +same as those defined in *netcdf.h*. For simplicity and +compactness, some netcdf.h API calls are mapped to the same +dispatch table function. In addition to the functions, the first +entry in the table defines the model that this dispatch table +implements. It will be one of the NC_FORMATX_XXX values. +The second entry in the table is the version of the dispatch table. +The rule is that previous entries may not be removed, but new entries +may be added, and adding new entries increases the version number. + +The dispatch table represents a distillation of the netcdf API down to +a minimal set of internal operations. The format of the dispatch table +is defined in the file *libdispatch/ncdispatch.h*. Every new dispatch +table must define this minimal set of operations. + +# Adding a New Dispatch Table +In order to make this process concrete, let us assume we plan to add +an in-memory implementation of netcdf-3. + +## Defining configure.ac flags + +Define a *–-enable* flag option for *configure.ac*. For our +example, we assume the option "--enable-ncm" and the +internal corresponding flag "enable_ncm". If you examine the existing +*configure.ac* and see how, for example, *--enable_dap2* is +defined, then it should be clear how to do it for your code. + +## Defining a "name space" + +Choose some prefix of characters to identify the new dispatch +system. In effect we are defining a name-space. For our in-memory +system, we will choose "NCM" and "ncm". NCM is used for non-static +procedures to be entered into the dispatch table and ncm for all other +non-static procedures. Note that the chosen prefix should probably start +with "nc" or "NC" in order to avoid name conflicts outside the netcdf-c library. + +## Extend include/netcdf.h + +Modify the file *include/netcdf.h* to add an NC_FORMATX_XXX flag +by adding a flag for this dispatch format at the appropriate places. +```` + #define NC_FORMATX_NCM 7 +```` + +Add any format specific new error codes. +```` +#define NC_ENCM (?) +```` + +## Extend include/ncdispatch.h + +Modify the file *include/ncdispatch.h* to +add format specific data and initialization functions; +note the use of our NCM namespace. +```` + #ifdef ENABLE_NCM + extern NC_Dispatch* NCM_dispatch_table; + extern int NCM_initialize(void); + #endif +```` + +## Define the dispatch table functions + +Define the functions necessary to fill in the dispatch table. As a +rule, we assume that a new directory is defined, *libsrcm*, say. Within +this directory, we need to define *Makefile.am* and *CMakeLists.txt*. +We also need to define the source files +containing the dispatch table and the functions to be placed in the +dispatch table -– call them *ncmdispatch.c* and *ncmdispatch.h*. Look at +*libsrc/nc3dispatch.[ch]* or *libnczarr/zdispatch.[ch]* for examples. + +Similarly, it is best to take existing *Makefile.am* and *CMakeLists.txt* +files (from *libsrcp* for example) and modify them. + +## Adding the dispatch code to libnetcdf + +Provide for the inclusion of this library in the final libnetcdf +library. This is accomplished by modifying *liblib/Makefile.am* by +adding something like the following. +```` + if ENABLE_NCM + libnetcdf_la_LIBADD += $(top_builddir)/libsrcm/libnetcdfm.la + endif +```` + +## Extend library initialization + +Modify the *NC_initialize* function in *liblib/nc_initialize.c* by adding +appropriate references to the NCM dispatch function. +```` + #ifdef ENABLE_NCM + extern int NCM_initialize(void); + #endif + ... + int NC_initialize(void) + { + ... + #ifdef ENABLE_NCM + if((stat = NCM_initialize())) return stat; + #endif + ... + } +```` + +Finalization is handled in an analogous fashion. + +## Testing the new dispatch table + +Add a directory of tests: *ncm_test*, say. The file *ncm_test/Makefile.am* +will look something like this. +```` + # These files are created by the tests. + CLEANFILES = ... + # These are the tests which are always run. + TESTPROGRAMS = test1 test2 ... + test1_SOURCES = test1.c ... + ... + # Set up the tests. + check_PROGRAMS = $(TESTPROGRAMS) + TESTS = $(TESTPROGRAMS) + # Any extra files required by the tests + EXTRA_DIST = ... +```` + +# Top-Level build of the dispatch code + +Provide for *libnetcdfm* to be constructed by adding the following to +the top-level *Makefile.am*. + +```` + if ENABLE_NCM + NCM=libsrcm + NCMTESTDIR=ncm_test + endif + ... + SUBDIRS = ... $(DISPATCHDIR) $(NCM) ... $(NCMTESTDIR) +```` + +# Choosing a Dispatch Table + +The dispatch table is ultimately chosen by the function +NC_infermodel() in libdispatch/dinfermodel.c. This function is +invoked by the NC_create and the NC_open procedures. This can +be, unfortunately, a complex process. The detailed operation of +NC_infermodel() is defined in the companion document in docs/dinternal.md. + +In any case, the choice of dispatch table is currently based on the following +pieces of information. + +1. The mode argument – this can be used to detect, for example, what kind +of file to create: netcdf-3, netcdf-4, 64-bit netcdf-3, etc. +Using a mode flag is the most common mechanism, in which case +*netcdf.h* needs to be modified to define the relevant mode flag. + +2. The file path – this can be used to detect, for example, a DAP url +versus a normal file system file. If the path looks like a URL, then +the fragment part of the URL is examined to determine the specific +dispatch function. + +3. The file contents - when the contents of a real file are available, +the contents of the file can be used to determine the dispatch table. +As a rule, this is likely to be useful only for *nc_open*. + +4. If the file is being opened vs being created. + +5. Is parallel IO available? + +The *NC_infermodel* function returns two values. + +1. model - this is used by nc_open and nc_create to choose the dispatch table. +2. newpath - in some case, usually URLS, the path may be rewritten to include extra information for use by the dispatch functions. + +# Special Dispatch Table Signatures. + +The entries in the dispatch table do not necessarily correspond +to the external API. In many cases, multiple related API functions +are merged into a single dispatch table entry. + +## Create/Open + +The create table entry and the open table entry in the dispatch table +have the following signatures respectively. +```` + int (*create)(const char *path, int cmode, + size_t initialsz, int basepe, size_t *chunksizehintp, + int useparallel, void* parameters, + struct NC_Dispatch* table, NC* ncp); + + int (*open)(const char *path, int mode, + int basepe, size_t *chunksizehintp, + int use_parallel, void* parameters, + struct NC_Dispatch* table, NC* ncp); +```` + +The key difference is that these are the union of all the possible +create/open signatures from the include/netcdfXXX.h files. Note especially the last +three parameters. The parameters argument is a pointer to arbitrary data +to provide extra info to the dispatcher. +The table argument is included in case the create +function (e.g. *NCM_create_) needs to invoke other dispatch +functions. The very last argument, ncp, is a pointer to an NC +instance. The raw NC instance will have been created by *libdispatch/dfile.c* +and is passed to e.g. open with the expectation that it will be filled in +by the dispatch open function. + +## Accessing Data with put_vara() and get_vara() + +```` + int (*put_vara)(int ncid, int varid, const size_t *start, const size_t *count, + const void *value, nc_type memtype); +```` + +```` + int (*get_vara)(int ncid, int varid, const size_t *start, const size_t *count, + void *value, nc_type memtype); +```` + +Most of the parameters are similar to the netcdf API parameters. The +last parameter, however, is the type of the data in +memory. Additionally, instead of using an "int islong" parameter, the +memtype will be either ::NC_INT or ::NC_INT64, depending on the value +of sizeof(long). This means that even netcdf-3 code must be prepared +to encounter the ::NC_INT64 type. + +## Accessing Attributes with put_attr() and get_attr() + +```` + int (*get_att)(int ncid, int varid, const char *name, + void *value, nc_type memtype); +```` + +```` + int (*put_att)(int ncid, int varid, const char *name, nc_type datatype, size_t len, + const void *value, nc_type memtype); +```` + +Again, the key difference is the memtype parameter. As with +put/get_vara, it used ::NC_INT64 to encode the long case. + +## Pre-defined Dispatch Functions + +It is sometimes not necessary to implement all the functions in the +dispatch table. Some pre-defined functions are available which may be +used in many cases. + +## Inquiry Functions + +Many of The netCDF inquiry functions operate from an in-memory model of +metadata. Once a file is opened, or a file is created, this +in-memory metadata model is kept up to date. Consequenty the inquiry +functions do not depend on the dispatch layer code. These functions +can be used by all dispatch layers which use the internal netCDF +enhanced data model. + +- NC4_inq +- NC4_inq_type +- NC4_inq_dimid +- NC4_inq_dim +- NC4_inq_unlimdim +- NC4_inq_att +- NC4_inq_attid +- NC4_inq_attname +- NC4_get_att +- NC4_inq_varid +- NC4_inq_var_all +- NC4_show_metadata +- NC4_inq_unlimdims +- NC4_inq_ncid +- NC4_inq_grps +- NC4_inq_grpname +- NC4_inq_grpname_full +- NC4_inq_grp_parent +- NC4_inq_grp_full_ncid +- NC4_inq_varids +- NC4_inq_dimids +- NC4_inq_typeids +- NC4_inq_type_equal +- NC4_inq_user_type +- NC4_inq_typeid + +## NCDEFAULT get/put Functions + +The mapped (varm) get/put functions have been +implemented in terms of the array (vara) functions. So dispatch layers +need only implement the vara functions, and can use the following +functions to get the and varm functions: + +- NCDEFAULT_get_varm +- NCDEFAULT_put_varm + +For the netcdf-3 format, the strided functions (nc_get/put_vars) +are similarly implemented in terms of the vara functions. So the following +convenience functions are available. + +- NCDEFAULT_get_vars +- NCDEFAULT_put_vars + +For the netcdf-4 format, the vars functions actually exist, so +the default vars functions are not used. + +## Read-Only Functions + +Some dispatch layers are read-only (ex. HDF4). Any function which +writes to a file, including nc_create(), needs to return error code +::NC_EPERM. The following read-only functions are available so that +these don't have to be re-implemented in each read-only dispatch layer: + +- NC_RO_create +- NC_RO_redef +- NC_RO__enddef +- NC_RO_sync +- NC_RO_set_fill +- NC_RO_def_dim +- NC_RO_rename_dim +- NC_RO_rename_att +- NC_RO_del_att +- NC_RO_put_att +- NC_RO_def_var +- NC_RO_rename_var +- NC_RO_put_vara +- NC_RO_def_var_fill + +## Classic NetCDF Only Functions + +There are two functions that are only used in the classic code. All +other dispatch layers (except PnetCDF) return error ::NC_ENOTNC3 for +these functions. The following functions are provided for this +purpose: + +- NOTNC3_inq_base_pe +- NOTNC3_set_base_pe + +# HDF4 Dispatch Layer as a Simple Example + +The HDF4 dispatch layer is about the simplest possible dispatch +layer. It is read-only, classic model. It will serve as a nice, simple +example of a dispatch layer. + +Note that the HDF4 layer is optional in the netCDF build. Not all +users will have HDF4 installed, and those users will not build with +the HDF4 dispatch layer enabled. For this reason HDF4 code is guarded +as follows. +```` +#ifdef USE_HDF4 +... +#endif /*USE_HDF4*/ +```` + +Code in libhdf4 is only compiled if HDF4 is +turned on in the build. + +### The netcdf.h File + +In the main netcdf.h file, we have the following: + +```` +#define NC_FORMATX_NC_HDF4 (3) +```` + +### The ncdispatch.h File + +In ncdispatch.h we have the following: + +```` +#ifdef USE_HDF4 +extern NC_Dispatch* HDF4_dispatch_table; +extern int HDF4_initialize(void); +extern int HDF4_finalize(void); +#endif +```` + +### The netcdf_meta.h File + +The netcdf_meta.h file allows for easy determination of what features +are in use. For HDF4, It contains the following, set by configure: +```` +... +#define NC_HAS_HDF4 0 /*!< HDF4 support. */ +... +```` + +### The hdf4dispatch.h File + +The file *hdf4dispatch.h* contains prototypes and +macro definitions used within the HDF4 code in libhdf4. This include +file should not be used anywhere except in libhdf4. + +### Initialization Code Changes in liblib Directory + +The file *nc_initialize.c* is modified to include the following: +```` +#ifdef USE_HDF4 +extern int HDF4_initialize(void); +extern int HDF4_finalize(void); +#endif +```` + +### Changes to libdispatch/dfile.c + +In order for a dispatch layer to be used, it must be correctly +determined in functions *NC_open()* or *NC_create()* in *libdispatch/dfile.c*. +HDF4 has a magic number that is detected in +*NC_interpret_magic_number()*, which allows *NC_open* to automatically +detect an HDF4 file. + +Once HDF4 is detected, the *model* variable is set to *NC_FORMATX_NC_HDF4*, +and later this is used in a case statement: +```` + case NC_FORMATX_NC_HDF4: + dispatcher = HDF4_dispatch_table; + break; +```` + +This sets the dispatcher to the HDF4 dispatcher, which is defined in +the libhdf4 directory. + +### Dispatch Table in libhdf4/hdf4dispatch.c + +The file *hdf4dispatch.c* contains the definition of the HDF4 dispatch +table. It looks like this: +```` +/* This is the dispatch object that holds pointers to all the + * functions that make up the HDF4 dispatch interface. */ +static NC_Dispatch HDF4_dispatcher = { +NC_FORMATX_NC_HDF4, +NC_DISPATCH_VERSION, +NC_RO_create, +NC_HDF4_open, +NC_RO_redef, +NC_RO__enddef, +NC_RO_sync, +... +NC_NOTNC4_set_var_chunk_cache, +NC_NOTNC4_get_var_chunk_cache, +... +}; +```` +Note that most functions use some of the predefined dispatch +functions. Functions that start with NC_RO* are read-only, they return +::NC_EPERM. Functions that start with NOTNC4* return ::NC_ENOTNC4. + +Only the functions that start with NC_HDF4* need to be implemented for +the HDF4 dispatch layer. There are 6 such functions: + +- NC_HDF4_open +- NC_HDF4_abort +- NC_HDF4_close +- NC_HDF4_inq_format +- NC_HDF4_inq_format_extended +- NC_HDF4_get_vara + +### HDF4 Reading Code + +The code in *hdf4file.c* opens the HDF4 SD dataset, and reads the +metadata. This metadata is stored in the netCDF internal metadata +model, allowing the inq functions to work. + +The code in *hdf4var.c* does an *nc_get_vara()* on the HDF4 SD +dataset. This is all that is needed for all the nc_get_* functions to +work. + +# Point of Contact {#filters_poc} + +*Author*: Dennis Heimbigner
+*Email*: dmh at ucar dot edu
+*Initial Version*: 12/22/2021
+*Last Revised*: 12/22/2021 diff --git a/docs/filters.md b/docs/filters.md index 00dbe59304..26cffce152 100644 --- a/docs/filters.md +++ b/docs/filters.md @@ -10,188 +10,212 @@ NetCDF-4 Filter Support {#filters} # Introduction to Filters {#filters_introduction} -The netCDF library supports a general filter mechanism to apply various -kinds of filters to datasets before reading or writing. - -The netCDF enhanced (aka netCDF-4) library inherits this capability since it depends on the HDF5 library. -The HDF5 library (1.8.11 and later) supports filters, and netCDF is based closely on that underlying HDF5 mechanism. - -Filters assume that a variable has chunking defined and each chunk is filtered before writing and "unfiltered" after reading and before passing the data to the user. - -In the event that multiple filters are defined on a variable, they are applied in first-defined order on writing and on the reverse order when reading. - -The most common kind of filter is a compression-decompression filter, and that is the focus of this document. - -For now, this document is strongly influenced by the HDF5 mechanism. -When other implementations (e.g. Zarr) support filters, this document will have multiple sections: one for each mechanism. +The netCDF library supports a general filter mechanism to apply +various kinds of filters to datasets before reading or writing. +The most common kind of filter is a compression-decompression +filter, and that is the focus of this document. +But non-compression filters – fletcher32, for example – also exist. + +The netCDF enhanced (aka netCDF-4) library inherits this +capability since it depends on the HDF5 library. The HDF5 +library (1.8.11 and later) supports filters, and netCDF is based +closely on that underlying HDF5 mechanism. + +Filters assume that a variable has chunking defined and each +chunk is filtered before writing and "unfiltered" after reading +and before passing the data to the user. In the event that +multiple filters are defined on a variable, they are applied in +first-defined order on writing and on the reverse order when +reading. + +This document describes the support for HDF5 filters and also +the newly added support for NCZarr filters. # A Warning on Backward Compatibility {#filters_compatibility} -The API defined in this document should accurately reflect -the current state of filters in the netCDF-c library. -Be aware that there was a short period in which the filter code was undergoing some revision and extension. -Those extensions have largely been reverted. -Unfortunately, some users may experience some compilation problems for previously working code because of these reversions. -In that case, please revise your code to adhere to this document. Apologies are extended for any inconvenience. +The API defined in this document should accurately reflect the +current state of filters in the netCDF-c library. Be aware that +there was a short period in which the filter code was undergoing +some revision and extension. Those extensions have largely been +reverted. Unfortunately, some users may experience some +compilation problems for previously working code because of +these reversions. In that case, please revise your code to +adhere to this document. Apologies are extended for any +inconvenience. A user may encounter an incompatibility if any of the following appears in user code. -* The function _nc_inq_var_filter_ was returning the error value _NC_ENOFILTER_ if a variable had no associated filters. - It has been reverted to the previous case where it returned _NC_NOERR_ and the returned filter id was set to zero if the variable had no filters. -* The function _nc_inq_var_filterids_ was renamed to _nc_inq_var_filter_ids_. -* Some auxilliary functions for parsing textual filter specifications have been moved to __netcdf_aux.h__. - See Appendix A. +* The function *\_nc\_inq\_var\_filter* was returning the error value NC\_ENOFILTER if a variable had no associated filters. + It has been reverted to the previous case where it returns NC\_NOERR and the returned filter id was set to zero if the variable had no filters. +* The function *nc\_inq\_var\_filterids* was renamed to *nc\_inq\_var\_filter\_ids*. +* Some auxilliary functions for parsing textual filter specifications have been moved to the file *netcdf\_aux.h*. See Appendix A. * All of the "filterx" functions have been removed. This is unlikely to cause problems because they had limited visibility. -* The undocumented function "nc_filter_remove" no longer exists. For additional information, see Appendix B. # Enabling A HDF5 Compression Filter {#filters_enable} -HDF5 supports dynamic loading of compression filters using the following process for reading of compressed data. +HDF5 supports dynamic loading of compression filters using the +following process for reading of compressed data. 1. Assume that we have a dataset with one or more variables that were compressed using some algorithm. How the dataset was compressed will be discussed subsequently. 2. Shared libraries or DLLs exist that implement the compress/decompress algorithm. These libraries have a specific API so that the HDF5 library can locate, load, and utilize the compressor. - These libraries are expected to installed in a specific directory. +3. These libraries are expected to installed in a specific directory. -In order to compress a variable with an HDF5 compliant filter, the netcdf-c library must be given three pieces of information: +In order to compress a variable with an HDF5 compliant filter, +the netcdf-c library must be given three pieces of information: 1. some unique identifier for the filter to be used, 2. a vector of parameters for controlling the action of the compression filter, and -3. a shared library implementation of the filter. - -The meaning of the parameters is, of course, completely filter dependent and the filter description [3] needs to be consulted. -For bzip2, for example, a single parameter is provided representing the compression level. -It is legal to provide a zero-length set of parameters. -Defaults are not provided, so this assumes that the filter can operate with zero parameters. - -Filter ids are assigned by the HDF group. -See [4] for a current list of assigned filter ids. -Note that ids above 32767 can be used for testing without registration. - -The first two pieces of information can be provided in one of three ways: using __ncgen__, via an API call, or via command line parameters to __nccopy__. -In any case, remember that filtering also requires setting chunking, so the variable must also be marked with chunking information. -If compression is set for a non-chunked variable, the variable will forcibly be +3. access to a shared library implementation of the filter. + +The meaning of the parameters is, of course, completely filter +dependent and the filter description [3] needs to be consulted. +For bzip2, for example, a single parameter is provided +representing the compression level. It is legal to provide a +zero-length set of parameters. Defaults are not provided, so +this assumes that the filter can operate with zero parameters. + +Filter ids are assigned by the HDF group. See [4] for a current +list of assigned filter ids. Note that ids above 32767 can be +used for testing without registration. + +The first two pieces of information can be provided in one of +three ways: (1) using *ncgen*, (2) via an API call, or (3) via +command line parameters to *nccopy*. In any case, remember that +filtering also requires setting chunking, so the variable must +also be marked with chunking information. If compression is set +for a non-chunked variable, the variable will forcibly be converted to chunked using a default chunking algorithm. ## Using The API {#filters_API} -The necessary API methods are included in _netcdf_filter.h_ by default. +The necessary API methods are included in *netcdf\_filter.h* by default. These functions implicitly use the HDF5 mechanisms and may produce an error if applied to a file format that is not compatible with the HDF5 mechanism. -1. Add a filter to the set of filters to be used when writing a variable. - - This must be invoked after the variable has been created and before __nc_enddef__ is invoked. +### nc\_def\_var\_filter +Add a filter to the set of filters to be used when writing a variable. This must be invoked after the variable has been created and before *nc\_enddef* is invoked. +```` + int nc_def_var_filter(int ncid, int varid, unsigned int id, + size_t nparams, const unsigned int* params); ```` -int nc_def_var_filter(int ncid, int varid, unsigned int id, size_t nparams, const unsigned int* params); - Arguments: -* ncid -- File and group ID. -* varid -- Variable ID. -* id -- Filter ID. -* nparams -- Number of filter parameters. -* params -- Filter parameters. + +* ncid — File and group ID. +* varid — Variable ID. +* id — Filter ID. +* nparams — Number of filter parameters. +* params — Filter parameters (a vector of unsigned integers) Return codes: -* NC_NOERR -- No error. -* NC_ENOTNC4 -- Not a netCDF-4 file. -* NC_EBADID -- Bad ncid or bad filter id -* NC_ENOTVAR -- Invalid variable ID. -* NC_EINDEFINE -- called when not in define mode -* NC_ELATEDEF -- called after variable was created -* NC_EINVAL -- Scalar variable, or parallel enabled and parallel filters not supported or nparams or params invalid. -```` -2. Query a variable to obtain a list of all filters associated with that variable. +* NC\_NOERR — No error. +* NC\_ENOTNC4 — Not a netCDF-4 file. +* NC\_EBADID — Bad ncid or bad filter id +* NC\_ENOTVAR — Invalid variable ID. +* NC\_EINDEFINE — called when not in define mode +* NC\_ELATEDEF — called after variable was created +* NC\_EINVAL — Scalar variable, or parallel enabled and parallel filters not supported or nparams or params invalid. - The number of filters associated with the variable is stored in __nfiltersp__ (it may be zero). - The set of filter ids will be returned in __filterids__. - As is usual with the netcdf API, one is expected to call this function twice. - The first time to set __nfiltersp__ and the second to get the filter ids in client-allocated memory. - Any of these arguments can be NULL, in which case no value is returned. +### nc\_inq\_var\_filter\_ids +Query a variable to obtain a list of the ids of all filters associated with that variable. ```` int nc_inq_var_filter_ids(int ncid, int varid, size_t* nfiltersp, unsigned int* filterids); - +```` Arguments: -* ncid -- File and group ID. -* varid -- Variable ID. -* nfiltersp -- Stores number of filters found; may be zero. -* filterids -- Stores set of filter ids. + +* ncid — File and group ID. +* varid — Variable ID. +* nfiltersp — Stores number of filters found; may be zero. +* filterids — Stores set of filter ids. Return codes: -* NC_NOERR -- No error. -* NC_ENOTNC4 -- Not a netCDF-4 file. -* NC_EBADID -- Bad ncid -* NC_ENOTVAR -- Invalid variable ID. -```` -3. Query a variable to obtain information about a specific filter associated with the variable. +* NC\_NOERR — No error. +* NC\_ENOTNC4 — Not a netCDF-4 file. +* NC\_EBADID — Bad ncid +* NC\_ENOTVAR — Invalid variable ID. + +The number of filters associated with the variable is stored in *nfiltersp* (it may be zero). +The set of filter ids will be returned in *filterids*. +As is usual with the netcdf API, one is expected to call this function twice. +The first time to set *nfiltersp* and the second to get the filter ids in client-allocated memory. +Any of these arguments can be NULL, in which case no value is returned. - The __id__ indicates the filter of interest. - The actual parameters are stored in __params__. - The number of parameters is returned in __nparamsp__. - As is usual with the netcdf API, one is expected to call this function twice. - The first time to set __nparamsp__ and the second to get the parameters in client-allocated memory. - Any of these arguments can be NULL, in which case no value is returned. - If the specified id is not attached to the variable, then NC_ENOFILTER is returned. +### nc\_inq\_var\_filter\_info +Query a variable to obtain information about a specific filter associated with the variable. ```` int nc_inq_var_filter_info(int ncid, int varid, unsigned int id, size_t* nparamsp, unsigned int* params); - +```` Arguments: -* ncid -- File and group ID. -* varid -- Variable ID. -* id -- The filter id of interest. -* nparamsp -- Stores number of parameters. -* params -- Stores set of filter parameters. -Return codes: -* NC_NOERR -- No error. -* NC_ENOTNC4 -- Not a netCDF-4 file. -* NC_EBADID -- Bad ncid -* NC_ENOTVAR -- Invalid variable ID. -* NC_ENOFILTER -- Filter not defined for the variable. -```` +* ncid — File and group ID. +* varid — Variable ID. +* id — The filter id of interest. +* nparamsp — Stores number of parameters. +* params — Stores set of filter parameters. -4. Query a variable to obtain information about the first filter associated with the variable. +Return codes: - When netcdf-c was modified to support multiple filters per variable, the utility of this function became redundant since it returns info only about the first defined filter for the variable. - Internally, it is implemented using the functions __nc_inq_var_filter_ids__ and __nc_inq_filter_info__. +* NC\_NOERR — No error. +* NC\_ENOTNC4 — Not a netCDF-4 file. +* NC\_EBADID — Bad ncid +* NC\_ENOTVAR — Invalid variable ID. +* NC\_ENOFILTER — Filter not defined for the variable. + +The *id* indicates the filter of interest. +The actual parameters are stored in *params*. +The number of parameters is returned in *nparamsp*. +As is usual with the netcdf API, one is expected to call this function twice. +The first time to set *nparamsp* and the second to get the parameters in client-allocated memory. +Any of these arguments can be NULL, in which case no value is returned. +If the specified id is not attached to the variable, then NC\_ENOFILTER is returned. + +### nc\_inq\_var\_filter +Query a variable to obtain information about the first filter associated with the variable. +When netcdf-c was modified to support multiple filters per variable, the utility of this function became redundant since it returns info only about the first defined filter for the variable. +Internally, it is implemented using the functions *nc\_inq\_var\_filter\_ids* and *nc\_inq\_filter\_info*. - In any case, the filter id will be returned in the __idp__ argument. - If there are not filters, then zero is stored in this argument. - Otherwise, the number of parameters is stored in __nparamsp__ and the actual parameters in __params__. - As is usual with the netcdf API, one is expected to call this function twice. - The first time to get __nparamsp__ and the second to get the parameters in client-allocated memory. - Any of these arguments can be NULL, in which case no value is returned. ```` int nc_inq_var_filter(int ncid, int varid, unsigned int* idp, size_t* nparamsp, unsigned int* params); +```` Arguments: -* ncid -- File and group ID. -* varid -- Variable ID. -* idp -- Stores the id of the first found filter, set to zero if variable has no filters. -* nparamsp -- Stores number of parameters. -* params -- Stores set of filter parameters. + +* ncid — File and group ID. +* varid — Variable ID. +* idp — Stores the id of the first found filter, set to zero if variable has no filters. +* nparamsp — Stores number of parameters. +* params — Stores set of filter parameters. Return codes: -* NC_NOERR -- No error. -* NC_ENOTNC4 -- Not a netCDF-4 file. -* NC_EBADID -- Bad ncid -* NC_ENOTVAR -- Invalid variable ID. -```` + +* NC\_NOERR — No error. +* NC\_ENOTNC4 — Not a netCDF-4 file. +* NC\_EBADID — Bad ncid +* NC\_ENOTVAR — Invalid variable ID. + +The filter id will be returned in the *idp* argument. +If there are no filters, then zero is stored in this argument. +Otherwise, the number of parameters is stored in *nparamsp* and the actual parameters in *params*. +As is usual with the netcdf API, one is expected to call this function twice. +The first time to get *nparamsp* and the second to get the parameters in client-allocated memory. +Any of these arguments can be NULL, in which case no value is returned. ## Using ncgen {#filters_NCGEN} In a CDL file, compression of a variable can be specified by annotating it with the following attribute: -* ''_Filter'' — a string containing a comma separated list of constants specifying (1) the filter id to apply, and (2) a vector of constants representing the parameters for controlling the operation of the specified filter. +* *\_Filter* — a string containing a comma separated list of constants specifying (1) the filter id to apply, and (2) a vector of constants representing the parameters for controlling the operation of the specified filter. See the section on the parameter encoding syntax for the details on the allowable kinds of constants. -This is a "special" attribute, which means that it will normally be invisible when using __ncdump__ unless the -s flag is specified. +This is a "special" attribute, which means that it will normally be invisible when using *ncdump* unless the -s flag is specified. + +For backward compatibility it is probably better to use the *\_Deflate* attribute instead of *\_Filter*. But using *\_Filter* to specify deflation will work. -This attribute may be repeated to specify multiple filters. -For backward compatibility it is probably better to use the ''_Deflate'' attribute instead of ''_Filter''. But using ''_Filter'' to specify deflation will work. +Multiple filters can be specified for a given variable by using the "|" separator. +Alternatively, this attribute may be repeated to specify multiple filters. Note that the lexical order of declaration is important when more than one filter is specified for a variable because it determines the order in which the filters are applied. @@ -215,45 +239,44 @@ Note that the assigned filter id for bzip2 is 307 and for szip it is 4. ## Using nccopy {#filters_NCCOPY} -When copying a netcdf file using __nccopy__ it is possible to specify filter information for any output variable by using the "-F" option on the command line; for example: -```` -nccopy -F "var,307,9" unfiltered.nc filtered.nc -```` -Assume that _unfiltered.nc_ has a chunked but not bzip2 compressed variable named "var". -This command will copy that variable to the _filtered.nc_ output file but using filter with id 307 (i.e. bzip2) and with parameter(s) 9 indicating the compression level. +When copying a netcdf file using *nccopy* it is possible to specify filter information for any output variable by using the "-F" option on the command line; for example: + + nccopy -F "var,307,9" unfiltered.nc filtered.nc + +Assume that *unfiltered.nc* has a chunked but not bzip2 compressed variable named "var". +This command will copy that variable to the *filtered.nc* output file but using filter with id 307 (i.e. bzip2) and with parameter(s) 9 indicating the compression level. See the section on the parameter encoding syntax for the details on the allowable kinds of constants. The "-F" option can be used repeatedly, as long as a different variable is specified for each occurrence. It can be convenient to specify that the same compression is to be applied to more than one variable. To support this, two additional *-F* cases are defined. -1. ````-F *,...```` means apply the filter to all variables in the dataset. -2. ````-F v1&v2&..,...```` means apply the filter to multiple variables. +1. *-F \*,...* means apply the filter to all variables in the dataset. +2. *-F v1&v2&..,...* means apply the filter to multiple variables. Multiple filters can be specified using the pipeline notions '|'. For example -1. ````-F v1&v2,307,9|4,32,32```` means apply filter 307 (bzip2) then filter 4 (szip) to the multiple variables. +1. *-F v1&v2,307,9|4,32,32* means apply filter 307 (bzip2) then filter 4 (szip) to the multiple variables. -Note that the characters '*', '&', and '|' are shell reserved characters, so you will probably need to escape or quote the filter spec in that environment. +Note that the characters '\*', '\&', and '\|' are shell reserved characters, so you will probably need to escape or quote the filter spec in that environment. As a rule, any input filter on an input variable will be applied to the equivalent output variable — assuming the output file type is netcdf-4. It is, however, sometimes convenient to suppress output compression either totally or on a per-variable basis. Total suppression of output filters can be accomplished by specifying a special case of "-F", namely this. -```` -nccopy -F none input.nc output.nc -```` -The expression ````-F *,none```` is equivalent to ````-F none````. + + nccopy -F none input.nc output.nc + +The expression *-F \*,none* is equivalent to *-F none*. Suppression of output filtering for a specific set of variables can be accomplished using these formats. -```` -nccopy -F "var,none" input.nc output.nc -nccopy -F "v1&v2&...,none" input.nc output.nc -```` + + nccopy -F "var,none" input.nc output.nc + nccopy -F "v1&v2&...,none" input.nc output.nc + where "var" and the "vi" are the fully qualified name of a variable. The rules for all possible cases of the "-F none" flag are defined by this table. -
-F none-Fvar,...Input FilterApplied Output Filter
trueundefinedNAunfiltered @@ -261,29 +284,29 @@ The rules for all possible cases of the "-F none" flag are defined by this table
truedefinedNAuse output filter(s)
falseundefineddefineduse input filter(s)
falsenoneNAunfiltered -
falsedefinedNAuse output filter(s) +
falsedefinedundefineduse output filter(s)
falseundefinedundefinedunfiltered
falsedefineddefineduse output filter(s)
# Filter Specification Syntax {#filters_syntax} -The utilities ncgen and nccopy, and also the output of __ncdump__, support the specification of filter ids, formats, and parameters in text format. -The BNF specification is defined in Appendix C. +The utilities ncgen and nccopy, and also the output of *ncdump*, support the specification of filter ids, formats, and parameters in text format. +The BNF specification is defined in Appendix C. Basically, These specifications consist of a filter id, a comma, and then a sequence of comma separated constants representing the parameters. The constants are converted within the utility to a proper set of unsigned int constants (see the parameter encoding section). To simplify things, various kinds of constants can be specified rather than just simple unsigned integers. -The __ncgen__ and __nccopy__ programs will encode them properly using the rules specified in the section on parameter encode/decode. -Since the original types are lost after encoding, __ncdump__ will always show a simple list of unsigned integer constants. +The *ncgen* and *nccopy* programs will encode them properly using the rules specified in the section on parameter encode/decode. +Since the original types are lost after encoding, *ncdump* will always show a simple list of unsigned integer constants. The currently supported constants are as follows.
ExampleTypeFormat TagNotes -
-17bsigned 8-bit byteb|BTruncated to 8 bits and zero extended to 32 bits +
-17bsigned 8-bit byteb|BTruncated to 8 bits and sign extended to 32 bits
23ubunsigned 8-bit byteu|U b|BTruncated to 8 bits and zero extended to 32 bits -
-25Ssigned 16-bit shorts|STruncated to 16 bits and zero extended to 32 bits +
-25Ssigned 16-bit shorts|STruncated to 16 bits and sign extended to 32 bits
27USunsigned 16-bit shortu|U s|STruncated to 16 bits and zero extended to 32 bits
-77implicit signed 32-bit integerLeading minus sign and no tag
77implicit unsigned 32-bit integerNo tag @@ -299,7 +322,7 @@ Some things to note. 2. For an untagged positive integer, the constant is treated as of the smallest type into which it fits (i.e. 8,16,32, or 64 bit). 3. For signed byte and short, the value is sign extended to 32 bits and then treated as an unsigned int value, but maintaining the bit-pattern. 4. For double, and signed|unsigned long long, they are converted as specified in the section on parameter encode/decode. -5. In order to support mutiple filters, the argument to ''_Filter'' may be a pipeline separated (using '|') to specify a list of filters specs. +5. In order to support mutiple filters, the argument to *\_Filter* may be a pipeline separated (using '|') to specify a list of filters specs. # Dynamic Loading Process {#filters_Process} @@ -314,7 +337,7 @@ The default directory is: * "/usr/local/hdf5/lib/plugin” for linux/unix operating systems (including Cygwin) * “%ALLUSERSPROFILE%\\hdf5\\lib\\plugin” for Windows systems, although the code does not appear to explicitly use this path. -The default may be overridden using the environment variable __HDF5_PLUGIN_PATH__. +The default may be overridden using the environment variable *HDF5\_PLUGIN\_PATH*. ## Plugin Library Naming {#filters_Pluginlib} @@ -332,10 +355,10 @@ Given a plugin directory, HDF5 examines every file in that directory that confor For each dynamic library located using the previous patterns, HDF5 attempts to load the library and attempts to obtain information from it. Specifically, It looks for two functions with the following signatures. -1. __H5PL_type_t H5PLget_plugin_type(void)__ — This function is expected to return the constant value __H5PL_TYPE_FILTER__ to indicate that this is a filter library. -2. __const void* H5PLget_plugin_info(void)__ — This function returns a pointer to a table of type __H5Z_class2_t__. +1. *H5PL\_type\_t H5PLget\_plugin\_type(void)* — This function is expected to return the constant value *H5PL\_TYPE\_FILTER* to indicate that this is a filter library. +2. *const void* H5PLget\_plugin\_info(void)* — This function returns a pointer to a table of type *H5Z\_class2\_t*. This table contains the necessary information needed to utilize the filter both for reading and for writing. - In particular, it specifies the filter id implemented by the library and it must match that id specified for the variable in __nc_def_var_filter__ in order to be used. + In particular, it specifies the filter id implemented by the library and it must match that id specified for the variable in *nc\_def\_var\_filter* in order to be used. If plugin verification fails, then that plugin is ignored and the search continues for another, matching plugin. @@ -346,9 +369,9 @@ For Zarr, filters are represented using the JSON notation. Each filter is defined by a JSON dictionary, and each such filter dictionary is guaranteed to have a key named "id" whose value is a unique string defining the filter algorithm: "lz4" or "bzip2", for example. -The parameters of the filter are defined by additional -- algorithm specific -- keys in the filter dictionary. +The parameters of the filter are defined by additional — algorithm specific — keys in the filter dictionary. One commonly used filter is "blosc", which has a JSON dictionary of this form. -```` +```` { "id": "blosc", "cname": "lz4", @@ -358,9 +381,9 @@ One commonly used filter is "blosc", which has a JSON dictionary of this form. ```` So it has three parameters: -1. "cname" -- the sub-algorithm used by the blosc compressor, LZ4 in this case. -2. "clevel" -- the compression level, 5 in this case. -3. "shuffle" -- is the input shuffled before compression, yes (1) in this case. +1. "cname" — the sub-algorithm used by the blosc compressor, LZ4 in this case. +2. "clevel" — the compression level, 5 in this case. +3. "shuffle" — is the input shuffled before compression, yes (1) in this case. NCZarr has four constraints that must be met. @@ -371,7 +394,7 @@ This means that some mechanism is needed to translate between the HDF5 id+parame 3. It must be possible to modify the set of visible parameters in response to environment information such as the type of the associated variable; this is required to mimic the corresponding HDF5 capability. 4. It must be possible to use filters even if HDF5 support is disabled. -Note that the term "visible parameters" is used here to refer to the parameters provided by "nc_def_var_filter" or those stored in the dataset's metadata as provided by the JSON codec. The term "working parameters" refers to the parameters given to the compressor itself and derived from the visible parameters. +Note that the term "visible parameters" is used here to refer to the parameters provided by "nc\_def\_var\_filter" or those stored in the dataset's metadata as provided by the JSON codec. The term "working parameters" refers to the parameters given to the compressor itself and derived from the visible parameters. The standard authority for defining Zarr filters is the list supported by the NumCodecs project [7]. Comparing the set of standard filters (aka codecs) defined by NumCodecs to the set of standard filters defined by HDF5 [3], it can be seen that the two sets overlap, but each has filters not defined by the other. @@ -382,25 +405,27 @@ Rather, it is preferable for there be some extensible way to associate the JSON The mechanism provided to address these issues is similar to that taken by HDF5. A shared library must exist that has certain well-defined entry points that allow the NCZarr code to determine information about a Codec. The shared library exports a well-known function name to access Codec information and relate it to a corresponding HDF5 implementation, +Note that the shared library may optionally be the same library containing the HDF5 +filter processor. ## Processing Overview There are several paths by which the NCZarr filter API is invoked. -1. The nc_def_var_filter function is invoked on a variable or +1. The nc\_def\_var\_filter function is invoked on a variable or (1a) the metadata for a variable is read when opening an existing variable that has associated Codecs. 2. The visible parameters are converted to a set of working parameters. 3. The filter is invoked with the working parameters. 4. The dataset is closed using the final set of visible parameters. -### Step 1: Invoking nc_def_var_filter +### Step 1: Invoking nc\_def\_var\_filter -In this case, the filter plugin is located and the set of visible parameters (from nc_def_var_filter) are provided. +In this case, the filter plugin is located and the set of visible parameters (from nc\_def\_var\_filter) are provided. ### Step 1a: Reading metadata In this case, the codec is read from the metadata and must be converted to a visible set of HDF5 style parameters. -It is possible that this set of visible parameters differs from the set that was provided by nc_def_var_filter. +It is possible that this set of visible parameters differs from the set that was provided by nc\_def\_var\_filter. If this is important, then the filter implementation is responsible for marking this difference using, for example, different number of parameters or some differing value. ### Step 2: Convert visible parameters to working parameters @@ -423,30 +448,30 @@ If no change is detected, then re-writing the compressor metadata may be avoided Currently, there is no way to specify use of a filter via Codec through the netcdf-c API. Rather, one must know the HDF5 id and parameters of -the filter of interest and use the functions ''nc_def_var_filter'' and ''nc_inq_var_filter''. +the filter of interest and use the functions *nc\_def\_var\_filter* and *nc\_inq\_var\_filter*. Internally, the NCZarr code will use information about known Codecs to convert the HDF5 filter reference to the corresponding Codec. -This restriction also holds for the specification of filters in ''ncgen'' and ''nccopy''. +This restriction also holds for the specification of filters in *ncgen* and *nccopy*. This limitation may be lifted in the future. ## Special Codecs Attribute -A new special attribute is defined called ''_Codecs'' in parallel to the current ''_Filters'' special attribute. Its value is a string containing the JSON representation of the Codecs associated with a given variable. +A new special attribute is defined called *\_Codecs* in parallel to the current *\_Filters* special attribute. Its value is a string containing the JSON representation of the Codecs associated with a given variable. This can be especially useful when a file is unreadable because it uses a filter not available to the netcdf-c library. -That is, no implementation was found in the e.g. ''HDF5_PLUGIN_PATH'' directory. -In this case ''ncdump -hs'' will display the raw Codec information so that it may be possible to see what filter is missing. +That is, no implementation was found in the e.g. *HDF5\_PLUGIN\_PATH* directory. +In this case *ncdump -hs* will display the raw Codec information so that it may be possible to see what filter is missing. ## Pre-Processing Filter Libraries The process for using filters for NCZarr is defined to operate in several steps. First, as with HDF5, all shared libraries in a specified directory -(e.g. ''HDF5_PLUGIN_PATH'') are scanned. +(e.g. *HDF5\_PLUGIN\_PATH*) are scanned. They are interrogated to see what kind of library they implement, if any. This interrogation operates by seeing if certain well-known (function) names are defined in this library. There will be two library types: -1. HDF5 -- exports a specific API: "H5Z\_plugin\_type" and "H5Z\_get\_plugin\_info". -2. Codec -- exports a specific API: "NCZ\_get\_codec\_info" +1. HDF5 — exports a specific API: "H5Z\_plugin\_type" and "H5Z\_get\_plugin\_info". +2. Codec — exports a specific API: "NCZ\_get\_codec\_info" Note that a given library can export either or both of these APIs. This means that we can have three types of libraries: @@ -455,7 +480,7 @@ This means that we can have three types of libraries: 2. Codec only 3. HDF5 + Codec -Suppose that our ''HDF5_PLUGIN_PATH'' location has an HDF5-only library. +Suppose that our *HDF5\_PLUGIN\_PATH* location has an HDF5-only library. Then by adding a corresponding, separate, Codec-only library to that same location, it is possible to make an HDF5 library usable by NCZarr. It is possible to do this without having to modify the HDF5-only library. Over time, it is possible to merge an HDF5-only library with a Codec-only library to produce a single, combined library. @@ -466,7 +491,7 @@ The netcdf-c library processes all of the shared libraries by interrogating each Any libraries that do not export one or both of the well-known APIs is ignored. Internally, the netcdf-c library pairs up each HDF5 library API with a corresponding Codec API by invoking the relevant well-known functions -(See Appendix E). +(See Appendix E/a>). This results in this table for associated codec and hdf5 libraries.
HDF5 APICodec APIAction @@ -481,7 +506,7 @@ As a special case, a shared library may be created to hold defaults for a common set of filters. Basically, there is a specially defined function that returns a vector of codec APIs. These defaults are used only if -not other library provided codec information for a filter. +no other library provides codec information for a filter. Currently, the defaults library provides codec defaults for Shuffle, Fletcher32, Deflate (zlib), and SZIP. @@ -493,9 +518,9 @@ filters and to process the meta-data in Codec JSON format. ### Writing an NCZarr Container -When writing, the user program will invoke the NetCDF API function *nc_def_var_filter*. +When writing, the user program will invoke the NetCDF API function *nc\_def\_var\_filter*. This function is currently defined to operate using HDF5-style id and parameters (unsigned ints). -The netcdf-c library examines its list of known filters to find one matching the HDF5 id provided by *nc_def_var_filter*. +The netcdf-c library examines its list of known filters to find one matching the HDF5 id provided by *nc\_def\_var\_filter*. The set of parameters provided is stored internally. Then during writing of data, the corresponding HDF5 filter is invoked to encode the data. @@ -536,7 +561,7 @@ is stored in the JSON dictionary form described earlier. The Codec style, using JSON, has the ability to provide very complex parameters that may be hard to encode as a vector of unsigned integers. It might be desirable to consider exporting a JSON-base API out of the netcdf-c API to support user access to this complexity. -This would mean providing some alternate version of "nc_def_var_filter" that takes a string-valued argument instead of a vector of unsigned ints. +This would mean providing some alternate version of "nc\_def\_var\_filter" that takes a string-valued argument instead of a vector of unsigned ints. This extension is unlikely to be implemented until a compelling use-case is encountered. One bad side-effect of this is that we then may have two classes of plugins. @@ -544,10 +569,54 @@ One class can be used by both HDF5 and NCZarr, and a second class that is usable ## Using The NetCDF-C Plugins -As part of its testing, the NetCDF build process creates a number of shared libraries in the ''netcdf-c/plugins'' (or sometimes ''netcdf-c/plugins/.libs'') directory. -If you need a filter from that set, you may be able to set ''HDF5_PLUGIN_PATH'' +As part of its testing, the NetCDF build process creates a number of shared libraries in the *netcdf-c/plugins* (or sometimes *netcdf-c/plugins/.libs*) directory. +If you need a filter from that set, you may be able to set *HDF5\_PLUGIN\_PATH* to point to that directory or you may be able to copy the shared libraries out of that directory to your own location. +# Lossy One-Way Filters + +As of NetCDF version 4.8.2, the netcdf-c library supports +bit-grooming filters. +```` +Bit-grooming is a lossy compression algorithm that removes the +bloat due to false-precision, those bits and bytes beyond the +meaningful precision of the data. Bit Grooming is statistically +unbiased, applies to all floating point numbers, and is easy to +use. Bit-Grooming reduces data storage requirements by +25-80%. Unlike its best-known competitor Linear Packing, Bit +Grooming imposes no software overhead on users, and guarantees +its precision throughout the whole floating point range [9]. +```` +The generic term "quantize" is used to refer collectively to the various +bitgroom algorithms. The key thing to note about quantization is that +it occurs at the point of writing of data only. Since its output is +legal data, it does not need to be "de-quantized" when the data is read. +Because of this, quantization is not part of the standard filter +mechanism and has a separate API. + +The API for bit-groom is currently as follows. +```` +int nc_def_var_quantize(int ncid, int varid, int quantize_mode, int nsd); +int nc_inq_var_quantize(int ncid, int varid, int *quantize_modep, int *nsdp); +```` +The *quantize_mode* argument specifies the particular algorithm. +Currently, three are supported: NC_QUANTIZE_BITGROOM, NC_QUANTIZE_GRANULARBR, +and NC_QUANTIZE_BITROUND. In addition quantization can be disabled using +the value NC_NOQUANTIZE. + +The input to ncgen or the output from ncdump supports special attributes +to indicate if quantization was applied to a given variable. +These attributes have the following form. +```` +_QuantizeBitGroomNumberOfSignificantDigits = +or +_QuantizeGranularBitRoundNumberOfSignificantDigits = +or +_QuantizeBitRoundNumberOfSignificantBits = +```` +The value NSD is the number of significant (decimal) digits to keep. +The value NSB is the number of significant bits to keep. + # Debugging {#filters_debug} Depending on the debugger one uses, debugging plugins can be very difficult. @@ -556,9 +625,9 @@ It may be necessary to use the old printf approach for debugging the filter itse One case worth mentioning is when there is a dataset that is using an unknown filter. For this situation, you need to identify what filter(s) are used in the dataset. This can be accomplished using this command. -```` -ncdump -s -h -```` + + ncdump -s -h + Since ncdump is not being asked to access the data (the -h flag), it can obtain the filter information without failures. Then it can print out the filter id and the parameters as well as the Codecs (via the -s flag). @@ -566,34 +635,35 @@ Then it can print out the filter id and the parameters as well as the Codecs (vi Within the netcdf-c source tree, the directory two directories contain test cases for testing dynamic filter operation. -* __netcdf-c/nc_test4__ provides tests for testing HDF5 filters. -* __netcdf-c/nczarr_test__ provides tests for testing NCZarr filters. +* *netcdf-c/nc\_test4* provides tests for testing HDF5 filters. +* *netcdf-c/nczarr\_test* provides tests for testing NCZarr filters. -These tests are disabled if __--disable-shared__ or if __--disable-filter-tests__ is specified. +These tests are disabled if *--disable-shared* or if *--disable-filter-tests* is specified +or if *--disable-plugins* is specified. ## HDF5 Example {#filters_Example} -A slightly simplified version of one of the HDF5 filter test cases is also available as an example within the netcdf-c source tree directory __netcdf-c/examples/C__. -The test is called __filter_example.c__ and it is executed as part of the __run_examples4.sh__ shell script. +A slightly simplified version of one of the HDF5 filter test cases is also available as an example within the netcdf-c source tree directory *netcdf-c/examples/C*. +The test is called *filter\_example.c* and it is executed as part of the *run\_examples4.sh* shell script. The test case demonstrates dynamic filter writing and reading. -The files __example/C/hdf5plugins/Makefile.am__ and __example/C/hdf5plugins/CMakeLists.txt__ demonstrate how to build the hdf5 plugin for bzip2. +The files *example/C/hdf5plugins/Makefile.am* and *example/C/hdf5plugins/CMakeLists.txt* demonstrate how to build the hdf5 plugin for bzip2. # Notes ## Order of Invocation for Multiple Filters -When multiple filters are defined on a variable, the order of application, when writing data to the file, is same as the order in which _nc_def_var_filter_ is called. +When multiple filters are defined on a variable, the order of application, when writing data to the file, is same as the order in which *nc\_def\_var\_filter*is called. When reading a file the order of application is of necessity the reverse. There are some special cases. 1. The fletcher32 filter is always applied first, if enabled. -2. If _nc_def_var_filter_ or _nc_def_var_deflate_ or _nc_def_var_szip_ is called multiple times with the same filter id, but possibly with different sets of parameters, then the position of that filter in the sequence of applictions does not change. +2. If *nc\_def\_var\_filter*or *nc\_def\_var\_deflate*or *nc\_def\_var\_szip*is called multiple times with the same filter id, but possibly with different sets of parameters, then the position of that filter in the sequence of applictions does not change. However the last set of parameters specified is used when actually writing the dataset. 3. Deflate and shuffle — these two are inextricably linked in the current API, but have quite different semantics. - If you call _nc_def_var_deflate_ multiple times, then the previous rule applies with respect to deflate. - However, the shuffle filter, if enabled, is ''always'' applied before applying any other filters, except fletcher32. + If you call *nc\_def\_var\_deflate*multiple times, then the previous rule applies with respect to deflate. + However, the shuffle filter, if enabled, is *always* applied before applying any other filters, except fletcher32. 4. Once a filter is defined for a variable, it cannot be removed nor can its position in the filter order be changed. ## Memory Allocation Issues @@ -603,13 +673,12 @@ Starting with HDF5 version 1.10.*, the plugin code MUST be careful when using th In the event that the code is allocating, reallocating, for free'ing memory that either came from or will be exported to the calling HDF5 library, then one MUST use the corresponding HDF5 -functions *H5allocate_memory()*, *H5resize_memory()*, -*H5free_memory()* [5] to avoid memory failures. +functions *H5allocate\_memory()*, *H5resize\_memory()*, +*H5free\_memory()* [5] to avoid memory failures. Additionally, if your filter code leaks memory, then the HDF5 library generates a failure something like this. -```` -H5MM.c:232: H5MM_final_sanity_check: Assertion `0 == H5MM_curr_alloc_bytes_s' failed. -```` + + H5MM.c:232: H5MM_final_sanity_check: Assertion `0 == H5MM_curr_alloc_bytes_s' failed. One can look at the the code in plugins/H5Zbzip2.c and H5Zmisc.c as illustrations. @@ -620,13 +689,22 @@ These are handled internally to (mostly) hide them so that they should not affec Specifically, this filter may do two things. 1. Add extra parameters to the filter parameters: going from the two parameters provided by the user to four parameters for internal use. - It turns out that the two parameters provided when calling nc_def_var_filter correspond to the first two parameters of the four parameters returned by nc_inq_var_filter. -2. Change the values of some parameters: the value of the __options_mask__ argument is known to add additional flag bits, and the __pixels_per_block__ parameter may be modified. + It turns out that the two parameters provided when calling nc\_def\_var\_filter correspond to the first two parameters of the four parameters returned by nc\_inq\_var\_filter. +2. Change the values of some parameters: the value of the *options\_mask* argument is known to add additional flag bits, and the *pixels\_per\_block* parameter may be modified. -The reason for these changes is has to do with the fact that the szip API provided by the underlying H5Pset_szip function is actually a subset of the capabilities of the real szip implementation. +The reason for these changes is has to do with the fact that the szip API provided by the underlying H5Pset\_szip function is actually a subset of the capabilities of the real szip implementation. Presumably this is for historical reasons. -In any case, if the caller uses the __nc_inq_var_szip__ or the __nc_inq_var_filter__ functions, then the parameter values returned may differ from those originally specified. +In any case, if the caller uses the *nc\_inq\_var\_szip* or the *nc\_inq\_var\_filter* functions, then the parameter values returned may differ from those originally specified. + +It should also be noted that the HDF5 szip filter wrapper that +is invoked depends on the configuration of the netcdf-c library. +If the HDF5 installation supports szip, then the NCZarr szip +will use the HDF5 wrapper. If HDF5 does not support szip, or HDF5 +is not enabled, then the plugins directory will contain a local +HDF5 szip wrapper to be used by NCZarr. This can be confusing, +but is generally transparent to the use since the plugins +HDF5 szip wrapper was taken from the HDF5 code base. ## Supported Systems @@ -639,9 +717,8 @@ The current matrix of OS X build systems known to work is as follows. ## Generic Plugin Build If you do not want to use Automake or Cmake, the following has been known to work. -```` -gcc -g -O0 -shared -o libbzip2.so -L${HDF5LIBDIR} -lhdf5_hl -lhdf5 -L${ZLIBDIR} -lz -```` + + gcc -g -O0 -shared -o libbzip2.so -L${HDF5LIBDIR} -lhdf5\_hl -lhdf5 -L${ZLIBDIR} -lz # References {#filters_References} @@ -649,9 +726,11 @@ gcc -g -O0 -shared -o libbzip2.so -L${HDF5LIBDIR} -lhdf5_ 2. https://support.hdfgroup.org/HDF5/doc/TechNotes/TechNote-HDF5-CompressionTroubleshooting.pdf 3. https://portal.hdfgroup.org/display/support/Registered+Filter+Plugins 4. https://support.hdfgroup.org/services/contributions.html#filters -5. https://support.hdfgroup.org/HDF5/doc/RM/RM_H5.html +5. https://support.hdfgroup.org/HDF5/doc/RM/RM\_H5.html 6. https://confluence.hdfgroup.org/display/HDF5/Filters 7. https://numcodecs.readthedocs.io/en/stable/ +8. https://github.com/ccr/ccr +9. https://escholarship.org/uc/item/7xd1739k # Appendix A. HDF5 Parameter Encode/Decode {#filters_appendixa} @@ -662,7 +741,8 @@ The bzip2 compression filter, for example, expects a single integer value from z This encodes naturally as a single unsigned integer. Note that signed integers and single-precision (32-bit) float values also can easily be represented as 32 bit unsigned integers by proper casting to an unsigned integer so that the bit pattern is preserved. -Simple integer values of type short or char (or the unsigned versions) can also be mapped to an unsigned integer by truncating to 16 or 8 bits respectively and then zero extending. +Simple signed integer values of type short or char can also be mapped to an unsigned integer by truncating to 16 or 8 bits respectively and then sign extending. Similarly, unsigned 8 and 16 bit +values can be used with zero extensions. Machine byte order (aka endian-ness) is an issue for passing some kinds of parameters. You might define the parameters when compressing on a little endian machine, but later do the decompression on a big endian machine. @@ -679,9 +759,10 @@ But this will be incorrect for 64-bit values. So, we have this situation (for HDF5 only): -1. the 8 bytes come in as native machine order for the machine doing the call to *nc_def_var_filter*. -2. HDF5 divides the 8 bytes into 2 four byte pieces and ensures that each piece is in network (big) endian order. -3. When the filter is called, the two pieces are returned in the same order but with the bytes in each piece consistent with the native machine order for the machine executing the filter. +1. the 8 bytes start as native machine order for the machine doing the call to *nc\_def\_var\_filter*. +2. The caller divides the 8 bytes into 2 four byte pieces and passes them to *nc\_def\_var\_filter*. +3. HDF5 takes each four byte piece and ensures that each piece is in network (big) endian order. +4. When the filter is called, the two pieces are returned in the same order but with the bytes in each piece consistent with the native machine order for the machine executing the filter. ## Encoding Algorithms for HDF5 @@ -723,38 +804,38 @@ To support these rules, some utility programs exist and are discussed in Filter Specification Syntax. -2. ''int ncaux_h5filterspec_parselist(const char* txt, int* formatp, size_t* nspecsp, struct NC_H5_Filterspec*** vectorp);'' +2. *int ncaux\_h5filterspec\_parselist(const char* txt, int* formatp, size\_t* nspecsp, struct NC\_H5\_Filterspec*** vectorp);* * txt contains the text of a sequence '|' separated filter specs. * formatp currently always returns 0. * nspecsp will return the number of filter specifications. * vectorp will return a pointer to a vector of pointers to filter specification instances — the caller must free. This function parses a sequence of filter specifications each separated by a '|' character. -The text between '|' separators must be parsable by __ncaux_h5filterspec_parse__. -3. ''void ncaux_h5filterspec_free(struct NC_H5_Filterspec* f);'' - * f is a pointer to an instance of ````struct NC_H5_Filterspec```` +The text between '|' separators must be parsable by *ncaux\_h5filterspec\_parse*. +3. *void ncaux\_h5filterspec\_free(struct NC\_H5\_Filterspec* f);* + * f is a pointer to an instance of *struct NC\_H5\_Filterspec* Typically this was returned as an element of the vector returned - by __ncaux_h5filterspec_parselist__. + by *\_ncaux\_h5filterspec\_parselist*. This reclaims the parameters of the filter spec object as well as the object itself. -4. ''int ncaux_h5filterspec_fix8(unsigned char* mem8, int decode);'' +4. *int ncaux\_h5filterspec\_fix8(unsigned char* mem8, int decode);* * mem8 is a pointer to the 8-byte value either to fix. * decode is 1 if the function should apply the 8-byte decoding algorithm else apply the encoding algorithm. This function implements the 8-byte conversion algorithms for HDF5. -Before calling *nc_def_var_filter* (unless *NC_parsefilterspec* was used), the client must call this function with the decode argument set to 0. +Before calling *nc\_def\_var\_filter* (unless *NC\_parsefilterspec* was used), the client must call this function with the decode argument set to 0. Inside the filter code, this function should be called with the decode argument set to 1. -Examples of the use of these functions can be seen in the test program ''nc_test4/tst_filterparser.c''. +Examples of the use of these functions can be seen in the test program *nc\_test4/tst\_filterparser.c*. -Some of the above functions use a C struct defined in _netcdf_filter.h_. +Some of the above functions use a C struct defined in *netcdf\_filter.h\_. The definition of that struct is as follows. ```` typedef struct NC_H5_Filterspec { @@ -767,19 +848,18 @@ This struct in effect encapsulates all of the information about and HDF5 formatt # Appendix C. Build Flags for Detecting the Filter Mechanism {#filters_appendixc} -The include file _netcdf_meta.h contains the following definition. +The include file *netcdf\_meta.h* contains the following definition. ```` -#define NC_HAS_MULTIFILTERS 1 + #define NC_HAS_MULTIFILTERS 1 ```` +This, in conjunction with the error code *NC\_ENOFILTER* in *netcdf.h* can be used to see what filter mechanism is in place as described in the section on incompatibities. -This, in conjunction with the error code _NC_ENOFILTER_ in _netcdf.h_ can be used to see what filter mechanism is in place as described in the section on incompatibities. - -1. !defined(NC_ENOFILTER) && !defined(NC_HAS_MULTIFILTERS) — indicates that the old pre-4.7.4 mechanism is in place. +1. !defined(NC\_ENOFILTER) && !defined(NC\_HAS\_MULTIFILTERS) — indicates that the old pre-4.7.4 mechanism is in place. It does not support multiple filters. -2. defined(NC_ENOFILTER) && !defined(NC_HAS_MULTIFILTERS) — indicates that the 4.7.4 mechanism is in place. - It does support multiple filters, but the error return codes for _nc_inq_var_filter_ are different and the filter spec parser functions are in a different location with different names. -3. defined(NC_ENOFILTER) && defined(NC_HAS_MULTIFILTERS) — indicates that the multiple filters are supported, and that _nc_inq_var_filter_ returns a filterid of zero to indicate that a variable has no filters. - Also, the filter spec parsers have the names and signatures described in this document and are define in _netcdf_aux.h_. +2. defined(NC\_ENOFILTER) && !defined(NC\_HAS\_MULTIFILTERS) — indicates that the 4.7.4 mechanism is in place. + It does support multiple filters, but the error return codes for *nc\_inq\_var\_filter* are different and the filter spec parser functions are in a different location with different names. +3. defined(NC\_ENOFILTER) && defined(NC\_HAS\_MULTIFILTERS) — indicates that the multiple filters are supported, and that *nc\_inq\_var\_filter* returns a filterid of zero to indicate that a variable has no filters. + Also, the filter spec parsers have the names and signatures described in this document and are define in *netcdf\_aux.h*. # Appendix D. BNF for Specifying Filters in Utilities {#filters_appendixd} @@ -800,24 +880,25 @@ parameter: unsigned32 where unsigned32: <32 bit unsigned integer> ```` + # Appendix E. Codec API {#filters_appendixe} The Codec API mirrors the HDF5 API closely. It has one well-known function that can be invoked to obtain information about the Codec as well as pointers to special functions to perform conversions. ## The Codec Plugin API -### NCZ_get_codec_info +### NCZ\_get\_codec\_info This function returns a pointer to a C struct that provides detailed information about the codec plugin. #### Signature ```` -void* NCZ_get_codec_info(void); + void* NCZ_get_codec_info(void); ```` -The value returned is actually of type ''struct NCZ_codec_t'', -but is of type ''void*'' to allow for extensions. +The value returned is actually of type *struct NCZ\_codec\_t*, +but is of type *void\** to allow for extensions. -### NCZ_codec_t +### NCZ\_codec\_t ```` typedef struct NCZ_codec_t { int version; /* Version number of the struct */ @@ -835,143 +916,192 @@ typedef struct NCZ_codec_t { The semantics of the non-function fields is as follows: -1. ''version'' -- Version number of the struct. -2. ''sort'' -- Format of remainder of the struct; currently always NCZ_CODEC_HDF5. -3. ''codecid'' -- The name/id of the codec. -4. ''hdf5id'' -- The corresponding hdf5 id. +1. *version* — Version number of the struct. +2. *sort* — Format of remainder of the struct; currently always NCZ\_CODEC\_HDF5. +3. *codecid* — The name/id of the codec. +4. *hdf5id* — The corresponding hdf5 id. -### NCZ_codec_to_hdf5 +### NCZ\_codec\_to\_hdf5 Given a JSON Codec representation, it will return a corresponding vector of unsigned integers representing the visible parameters. #### Signature - int (*NCZ_codec_to_hdf)(const char* codec, int* nparamsp, unsigned** paramsp); - +```` + int NCZ_codec_to_hdf(const char* codec, int* nparamsp, unsigned** paramsp); +```` #### Arguments -1. codec -- (in) ptr to JSON string representing the codec. -2. nparamsp -- (out) store the length of the converted HDF5 unsigned vector -3. paramsp -- (out) store a pointer to the converted HDF5 unsigned vector; caller must free the returned vector. Note the double indirection. +1. codec — (in) ptr to JSON string representing the codec. +2. nparamsp — (out) store the length of the converted HDF5 unsigned vector +3. paramsp — (out) store a pointer to the converted HDF5 unsigned vector; caller must free the returned vector. Note the double indirection. Return Value: a netcdf-c error code. -### NCZ_hdf5_to_codec +### NCZ\_hdf5\_to\_codec Given an HDF5 visible parameters vector of unsigned integers and its length, return a corresponding JSON codec representation of those visible parameters. #### Signature - int (*NCZ_hdf5_to_codec)(int ncid, int varid, size_t nparams, const unsigned* params, char** codecp); - +```` + int NCZ_hdf5_to_codec)(int ncid, int varid, size_t nparams, const unsigned* params, char** codecp); +```` #### Arguments -1. ncid -- the variables' containing group -2. varid -- the containing variable -3. nparams -- (in) the length of the HDF5 visible parameters vector -4. params -- (in) pointer to the HDF5 visible parameters vector. -5. codecp -- (out) store the string representation of the codec; caller must free. +1. ncid — the variables' containing group +2. varid — the containing variable +3. nparams — (in) the length of the HDF5 visible parameters vector +4. params — (in) pointer to the HDF5 visible parameters vector. +5. codecp — (out) store the string representation of the codec; caller must free. Return Value: a netcdf-c error code. -### NCZ_modify_parameters +### NCZ\_modify\_parameters Extract environment information from the (ncid,varid) and use it to convert a set of visible parameters to a set of working parameters; also provide option to modify visible parameters. #### Signature - int (*NCZ_modify_parameters)(int ncid, int varid, size_t* vnparamsp, unsigned** vparamsp, size_t* wnparamsp, unsigned** wparamsp); - +```` + int NCZ_modify_parameters(int ncid, int varid, size_t* vnparamsp, unsigned** vparamsp, size_t* wnparamsp, unsigned** wparamsp); +```` #### Arguments -1. ncid -- (in) group id containing the variable. -2. varid -- (in) the id of the variable to which this filter is being attached. -3. vnparamsp -- (in/out) the count of visible parameters -4. vparamsp -- (in/out) the set of visible parameters -5. wnparamsp -- (out) the count of working parameters -4. wparamsp -- (out) the set of working parameters +1. ncid — (in) group id containing the variable. +2. varid — (in) the id of the variable to which this filter is being attached. +3. vnparamsp — (in/out) the count of visible parameters +4. vparamsp — (in/out) the set of visible parameters +5. wnparamsp — (out) the count of working parameters +4. wparamsp — (out) the set of working parameters Return Value: a netcdf-c error code. -### NCZ_codec_initialize +### NCZ\_codec\_initialize Some compressors may require library initialization. This function is called as soon as a shared library is loaded and matched with an HDF5 filter. #### Signature - int (*NCZ_codec_initialize)(void); - +```` + int NCZ_codec_initialize)(void); +```` Return Value: a netcdf-c error code. -### NCZ_codec_finalize +### NCZ\_codec\_finalize Some compressors (like blosc) require invoking a finalize function in order to avoid memory loss. -This function is called during a call to ''nc_finalize'' to do any finalization. -If the client code does not invoke ''nc_finalize'' then memory checkers may complain about lost memory. +This function is called during a call to *nc\_finalize* to do any finalization. +If the client code does not invoke *nc\_finalize* then memory checkers may complain about lost memory. #### Signature - int (*NCZ_codec_finalize)(void); - +```` + int NCZ_codec_finalize)(void); +```` Return Value: a netcdf-c error code. ## Multi-Codec API -As an aid to clients, it is convenient if a single shared library can provide multiple ''NCZ_code_t'' instances at one time. +As an aid to clients, it is convenient if a single shared library can provide multiple *NCZ\_code\_t* instances at one time. This API is not intended to be used by plugin developers. A shared library must only export this function. -### NCZ_codec_info_defaults +### NCZ\_codec\_info\_defaults -Return a NULL terminated vector of pointers to instances of ''NCZ_codec_t''. +Return a NULL terminated vector of pointers to instances of *NCZ\_codec\_t*. #### Signature +```` void* NCZ_codec_info_defaults(void); - -The value returned is actually of type ''NCZ_codec_t**'', -but is of type ''void*'' to allow for extensions. +```` +The value returned is actually of type *NCZ\_codec\_t***, +but is of type *void** to allow for extensions. The list of returned items are used to try to provide defaults for any HDF5 filters that have no corresponding Codec. This is for internal use only. -# Appendix F. Pre-built Filters - -As part of the overall build process, a number of filters are built as shared libraries in the "plugins" directory -— in that directory or the "plugins/.libs" subdirectory. +# Appendix F. Standard Filters -An option exists to allow some of those filters to be installed into a user-specified directory. The relevant options are as follows: +Support for a select set of standard filters is built into the NetCDF API. +Generally, they are accessed using the following generic API, where XXXX is +the filter name. As a rule, the names are those used in the HDF5 filter ID naming authority [4] or the NumCodecs naming authority [7]. ```` -./configure: --with-plugin-dir= -cmake: -DPLUGIN_INSTALL_DIR= +int nc_def_var_XXXX(int ncid, int varid, unsigned filterid, size_t nparams, unsigned* params); +int nc_inq_var_XXXX(int ncid, int varid, int* hasfilter, size_t* nparamsp, unsigned* params); ```` -If the value of the environment variable "HDF5_PLUGIN_PATH" is a single directory, then -a good value for the install directory is "$HDF5_PLUGIN_PATH", so for example: -```` -./configure ... --with-plugin-dir="$HDF5_PLUGIN_DIR" -```` - -If this option is specified, then as part of the "install" build action, -a specified set of filter shared libraries will be copied into the specified directory. -Any existing library of the same name will be overwritten. If the specified directory -itself does not exist, then it will be created. - -Currently, if the following filters are available, they will be installed; -* ''libh5bzip2.so'' -- an HDF5 filter wrapper for bzip2 compression -* ''libh5blosc.so'' -- an HDF5 filter wrapper for blosc compression -* ''libh5zstd.so'' -- an HDF5 filter wrapper for zstandardcompression - -If the user is using NCZarr filters, then the following additional filters will be installed. -* libh5shuffle.so -- shuffle filter -* libh5fletcher32.so -- fletcher32 checksum -* libh5deflate.so -- deflate compression -* libh5szip.so -- szip compression -* libnczdefaults.so -- provide NCZarr support for shuffle, fletcher32, and deflate. -* libnczszip.so -- provide NCZarr support for szip. +The first function inserts the specified filter into the filter chain for a given variable. +The second function queries the given variable to see if the specified function +is in the filter chain for that variable. The *hasfilter* argument is set +to one if the filter is in the chain and zero otherwise. +As is usual with the netcdf API, one is expected to call this function twice. +The first time to set *nparamsp* and the second to get the parameters in the client-allocated memory argument *params*. +Any of these arguments can be NULL, in which case no value is returned. + +Note that NetCDF inherits four filters from HDF5, namely shuffle, fletcher32, deflate (zlib), and szip. The API's for these do not conform to the above API. +So aside from those four, the current set of standard filters is as follows. + +
Filter NameFilter IDReference +
zstandard32015https://facebook.github.io/zstd/ +
bzip2307https://sourceware.org/bzip2/ +
-The shuffle, fletcher32, deflate, and szip filters in this case will be ignored by HDF5 and only used by the NCZarr code. -Note that if you disable HDF5 support, but leave NCZarr support enabled, then all of the above filters -should continue to work. +# Appendix G. Finding Filters + +A major problem for filter users is finding an implementation for a filter. +There are several ways to do this. + +* **HDF5 Assigned Filter Identifiers Repository [3]** — +HDF5 maintains a page of standard filter identifiers along with +additional contact information. This often includes a pointer +to source code. + +* **Community Codec Repository** — +The Community Codec Repository (CCR) project [8] provides +filters, including HDF5 wrappers, for a number of filters. +You can install this library to get access to these supported filters. +It does not currently include the required NCZarr Codec API, +so they are only usable with netcdf-4. This will change in the future. + +* **NetCDF-C Test Plugins Directory** — +As part of the overall build process, a number of filters are built as shared libraries in the "plugins" directory. +They may be in that directory or the "plugins/.libs" subdirectory. +It may be possible for users to utilize some of those libraries to provide filter support for general use. + + + If the user is using NCZarr filters, then the plugins directory has at least the following shared libraries + * libh5shuffle.so — shuffle filter + * libh5fletcher32.so — fletcher32 checksum + * libh5deflate.so — deflate compression + * libnczdefaults.so — provide NCZarr support for shuffle, fletcher32, and deflate. + * *libh5bzip2.so* — an HDF5 filter for bzip2 compression + * *libh5blosc.so* — an HDF5 filter for blosc compression + * *libh5zstd.so* — an HDF5 filter for zstandard compression + + The shuffle, fletcher32, and deflate filters in this case will + be ignored by HDF5 and only used by the NCZarr code. But in + order to use them, it needs additional Codec capabilities + provided by the libnczdefauts.so shared library. Note also that + if you disable HDF5 support, but leave NCZarr support enabled, + then all of the above filters should continue to work. + +## HDF5_PLUGIN_PATH + +At the moment, NetCDF uses the existing HDF5 environment variable +*HDF5\_PLUGIN\_PATH* to locate the directories in which filter wrapper +shared libraries are located. This is used both for the HDF5 filter +wrappers but also the NCZarr codec wrappers. + +*HDF5\_PLUGIN\_PATH* is a typical Windows or Unix style +path-list. That is it is a sequence of absolute directory paths +separated by a specific separator character. For Windows, the +separator character is a semicolon (';') and for Unix, it is a a +colon (':'). For convenience, NCZarr will also accept the +semicolon separator for Unix. + +So, the user can add the CCR and/or the plugins directory to +the *HDF5\_PLUGIN\_PATH* environment variable to allow the netcdf-c +library to locate wrappers. # Point of Contact {#filters_poc} -__Author__: Dennis Heimbigner
-__Email__: dmh at ucar dot edu
-__Initial Version__: 1/10/2018
-__Last Revised__: 7/17/2021 - +*Author*: Dennis Heimbigner
+*Email*: dmh at ucar dot edu
+*Initial Version*: 1/10/2018
+*Last Revised*: 3/14/2022 diff --git a/docs/internal.md b/docs/internal.md new file mode 100644 index 0000000000..05c35257c6 --- /dev/null +++ b/docs/internal.md @@ -0,0 +1,639 @@ +Notes On the Internals of the NetCDF-C Library +============================ + + +# Notes On the Internals of the NetCDF-C Library {#intern_head} + +\tableofcontents + +This document attempts to record important information about +the internal architecture and operation of the netcdf-c library. + +# 1. Including C++ Code in the netcdf-c Library {#intern_c++} + +The state of C compiler technology has reached the point where +it is possible to include C++ code into the netcdf-c library +code base. Two examples are: + +1. The AWS S3 SDK wrapper *libdispatch/ncs3sdk.cpp* file. +2. The TinyXML wrapper *ncxml\_tinyxml2.cpp* file. + +However there are some consequences that must be handled for this to work. +Specifically, the compiler must be told that the C++ runtime is needed +in the following ways. + +## Modifications to *lib\_flags.am* +Suppose we have a flag *ENABLE\_XXX* where that XXX +feature entails using C++ code. Then the following must be added +to *lib\_flags.am* +```` +if ENABLE_XXX +AM_LDFLAGS += -lstdc++ +endif +```` + +## Modifications to *libxxx/Makefile.am* + +The Makefile in which the C++ code is included and compiled +(assumed here to be the *libxxx* directory) must have this set. +```` +AM_CXXFLAGS = -std=c++11 +```` +It is possible that other values (e.g. *-std=c++14*) may also work. + +# 2. Managing instances of complex data types + +For a long time, there have been known problems with the +management of complex types containing VLENs. This also +involves the string type because it is stored as a VLEN of +chars. + +The term complex type refers to any type that directly or +recursively references a VLEN type. So an array of VLENS, a +compound with a VLEN field, and so on. + +In order to properly handle instances of these complex types, it +is necessary to have function that can recursively walk +instances of such types to perform various actions on them. The +term "deep" is also used to mean recursive. + +Two deep walking operations are provided by the netcdf-c library +to aid in managing instances of complex structures. +* free'ing an instance of the complex type +* copying an instance of the complex type. + +Previously The netcdf-c library only did shallow free and shallow copy of +complex types. This meant that only the top level was properly +free'd or copied, but deep internal blocks in the instance were +not touched. This led to a host of memory leaks and failures +when the deep data was effectively shared between the netcdf-c library +internally and the user's data. + +Note that the term "vector" is used to mean a contiguous (in +memory) sequence of instances of some type. Given an array with, +say, dimensions 2 X 3 X 4, this will be stored in memory as a +vector of length 2*3*4=24 instances. + +The use cases are primarily these. + +## nc\_get\_vars +Suppose one is reading a vector of instances using nc\_get\_vars +(or nc\_get\_vara or nc\_get\_var, etc.). These functions will +return the vector in the top-level memory provided. All +interior blocks (form nested VLEN or strings) will have been +dynamically allocated. Note that computing the size of the vector +may be tricky because the strides must be taken into account. + +After using this vector of instances, it is necessary to free +(aka reclaim) the dynamically allocated memory, otherwise a +memory leak occurs. So, the recursive reclaim function is used +to walk the returned instance vector and do a deep reclaim of +the data. + +Currently functions are defined in netcdf.h that are supposed to +handle this: nc\_free\_vlen(), nc\_free\_vlens(), and +nc\_free\_string(). Unfortunately, these functions only do a +shallow free, so deeply nested instances are not properly +handled by them. They are marked in the description as +deprecated in favor of the newer recursive function. + +## nc\_put\_vars + +Suppose one is writing a vector of instances using nc\_put\_vars +(or nc\_put\_vara or nc\_put\_var, etc.). These functions will +write the contents of the vector to the specified variable. +Note that internally, the data passed to the nc\_put\_xxx function is +immediately written so there is no need to copy it internally. But the +caller may need to reclaim the vector of data that was created and passed +in to the nc\_put\_xxx function. + +After writing this vector of instances, and assuming it was dynamically +created, at some point it will be necessary to reclaim that data. +So again, the recursive reclaim function can be used +to walk the returned instance vector and do a deep reclaim of +the data. + +## nc\_put\_att +Suppose one is writing a vector of instances as the data of an attribute +using, say, nc\_put\_att. + +Internally, the incoming attribute data must be copied and stored +so that changes/reclamation of the input data will not affect +the attribute. Note that this copying behavior is different from +writing to a variable, where the data is written immediately. + +Again, the code inside the netcdf library used to use only shallow copying +rather than deep copy. As a result, one saw effects such as described +in Github Issue https://github.com/Unidata/netcdf-c/issues/2143. + +Also, after defining the attribute, it may be necessary for the user +to free the data that was provided as input to nc\_put\_att() as in the +nc\_put\_xxx functions (previously described). + +## nc\_get\_att +Suppose one is reading a vector of instances as the data of an attribute +using, say, nc\_get\_att. + +Internally, the existing attribute data must be copied and returned +to the caller, and the caller is responsible for reclaiming +the returned data. + +Again, the code inside the netcdf library used to only do shallow copying +rather than deep copy. So this could lead to memory leaks and errors +because the deep data was shared between the library and the user. + +## New Instance Walking API + +Proper recursive functions were added to the netcdf-c library to +provide reclaim and copy functions and use those as needed. +These functions are defined in libdispatch/dinstance.c and their +signatures are defined in include/netcdf.h. For back +compatibility, corresponding "ncaux\_XXX" functions are defined +in include/netcdf\_aux.h. +```` +int nc_reclaim_data(int ncid, nc_type xtypeid, void* memory, size_t count); +int nc_reclaim_data_all(int ncid, nc_type xtypeid, void* memory, size_t count); +int nc_copy_data(int ncid, nc_type xtypeid, const void* memory, size_t count, void* copy); +int nc_copy_data_all(int ncid, nc_type xtypeid, const void* memory, size_t count, void** copyp); +```` +There are two variants. The first two, nc\_reclaim\_data() and +nc\_copy\_data(), assume the top-level vector is managed by the +caller. For reclaim, this is so the user can use, for example, a +statically allocated vector. For copy, it assumes the user +provides the space into which the copy is stored. + +The second two, nc\_reclaim\_data\_all() and +nc\_copy\_data\_all(), allows the functions to manage the +top-level. So for nc\_reclaim\_data\_all, the top level is +assumed to be dynamically allocated and will be free'd by +nc\_reclaim\_data\_all(). The nc\_copy\_data\_all() function +will allocate the top level and return a pointer to it to the +user. The user can later pass that pointer to +nc\_reclaim\_data\_all() to reclaim the instance(s). + +# Internal Changes +The netcdf-c library internals are changed to use the proper reclaim +and copy functions. This also allows some simplification of the code +since the stdata and vldata fields of NC\_ATT\_INFO are no longer needed. +Currently this is commented out using the SEPDATA \#define macro. +When the bugs are found and fixed, all this code will be removed. + +## Optimizations + +In order to make these functions as efficient as possible, it is +desirable to classify all types as to whether or not they contain +variable-size data. If a type is fixed sized (i.e. does not contain +variable-size data) then it can be freed or copied as a single chunk. +This significantly increases the performance for such types. +For variable-size types, it is necessary to walk each instance of the type +and recursively reclaim or copy it. As another optimization, +if the type is a vector of strings, then the per-instance walk can be +sped up by doing the reclaim or copy inline. + +The rules for classifying types as fixed or variable size are as follows. + +1. All atomic types, except string, are fixed size. +2. All enum type and opaque types are fixed size. +3. All string types and VLEN types are variable size. +4. A compound type is fixed size if all of the types of its + fields are fixed size. Otherwise it has variable size. + +The classification of types can be made at the time the type is defined +or is read in from some existing file. The reclaim and copy functions +use this information to speed up the handling of fixed size types. + +# Warnings + +1. The new API functions require that the type information be + accessible. This means that you cannot use these functions + after the file has been closed. After the file is closed, you + are on your own. + +2. There is still one known failure that has not been solved; it is + possibly an HDF5 memory leak. All the failures revolve around + some variant of this .cdl file. The proximate cause of failure is + the use of a VLEN FillValue. +```` + netcdf x { + types: + float(*) row_of_floats ; + dimensions: + m = 5 ; + variables: + row_of_floats ragged_array(m) ; + row_of_floats ragged_array:_FillValue = {-999} ; + data: + ragged_array = {10, 11, 12, 13, 14}, {20, 21, 22, 23}, {30, 31, 32}, + {40, 41}, _ ; + } +```` + +# 3. Inferring File Types + +As described in the companion document -- docs/dispatch.md -- +when nc\_create() or nc\_open() is called, it must figure out what +kind of file is being created or opened. Once it has figured out +the file kind, the appropriate "dispatch table" can be used +to process that file. + +## The Role of URLs + +Figuring out the kind of file is referred to as model inference +and is, unfortunately, a complicated process. The complication +is mostly a result of allowing a path argument to be a URL. +Inferring the file kind from a URL requires deep processing of +the URL structure: the protocol, the host, the path, and the fragment +parts in particular. The query part is currently not used because +it usually contains information to be processed by the server +receiving the URL. + +The "fragment" part of the URL may be unfamiliar. +The last part of a URL may optionally contain a fragment, which +is syntactically of this form in this pseudo URL specification. +```` +:///?# +```` +The form of the fragment is similar to a query and takes this general form. +```` +'#'=&=&... +```` +The key is a simple name, the value is any sequence of characters, +although URL special characters such as '&' must be URL encoded in +the '%XX' form where each X is a hexadecimal digit. +An example might look like this non-sensical example: +```` +https://host.com/path#mode=nczarr,s3&bytes +```` +It is important to note that the fragment part is not intended to be +passed to the server, but rather is processed by the client program. +It is this property that allows the netcdf-c library to use it to +pass information deep into the dispatch table code that is processing the +URL. + +## Model Inference Inputs + +The inference algorithm is given the following information +from which it must determine the kind of file being accessed. + +### Mode + +The mode is a set of flags that are passed as the second +argument to nc\_create and nc\_open. The set of flags is define in +the netcdf.h header file. Generally it specifies the general +format of the file: netcdf-3 (classic) or netcdf-4 (enhanced). +Variants of these can also be specified, e.g. 64-bit netcdf-3 or +classic netcdf-4. +In the case where the path argument is a simple file path, +using a mode flag is the most common mechanism for specifying +the model. + +### Path +The file path, the first argument to nc\_create and nc\_open, +Can be either a simple file path or a URL. +If it is a URL, then it will be deeply inspected to determine +the model. + +### File Contents +When the contents of a real file are available, +the contents of the file can be used to determine the dispatch table. +As a rule, this is likely to be useful only for *nc\_open*. +It also requires access to functions that can open and read at least +the initial part of the file. +As a rule, the initial small prefix of the file is read +and examined to see if it matches any of the so-called +"magic numbers" that indicate the kind of file being read. + +### Open vs Create +Is the file being opened or is it being created? + +### Parallelism +Is parallel IO available? + +## Model Inference Outputs +The inferencing algorithm outputs two pieces of information. + +1. model - this is used by nc\_open and nc\_create to choose the dispatch table. +2. newpath - in some case, usually URLS, the path may be rewritten to include extra information for use by the dispatch functions. + +The model output is actually a struct containing two fields: + +1. implementation - this is a value from the NC\_FORMATX\_xxx + values in netcdf.h. It generally determines the dispatch + table to use. +2. format -- this is an NC\_FORMAT\_xxx value defining, in effect, + the netcdf-format to which the underlying format is to be + translated. Thus it can tell the netcdf-3 dispatcher that it + should actually implement CDF5 rather than standard netcdf classic. + +## The Inference Algorithm + +The construction of the model is primarily carried out by the function +*NC\_infermodel()* (in *libdispatch/dinfermodel.c). +It is given the following parameters: +1. path -- (IN) absolute file path or URL +2. modep -- (IN/OUT) the set of mode flags given to *NC\_open* or *NC\_create*. +3. iscreate -- (IN) distinguish open from create. +4. useparallel -- (IN) indicate if parallel IO can be used. +5. params -- (IN/OUT) arbitrary data dependent on the mode and path. +6. model -- (IN/OUT) place to store inferred model. +7. newpathp -- (OUT) the canonical rewrite of the path argument. + +As a rule, these values are used in the this order of preference +to infer the model. + +1. file contents -- highest precedence +2. url (if it is one) -- using the "mode=" key in the fragment (see below). +3. mode flags +4. default format -- lowest precedence + +The sequence of steps is as follows. + +### URL processing -- processuri() + +If the path appears to be a URL, then it is parsed +and processed by the processuri function as follows. + +1. Protocol -- +The protocol is extracted and tested against the list of +legal protocols. If not found, then it is an error. +If found, then it is replaced by a substitute -- if specified. +So, for example, the protocol "dods" is replaced the protocol "http" +(note that at some point "http" will be replaced with "https"). +Additionally, one or more "key=value" strings is appended +to the existing fragment of the url. So, again for "dods", +the fragment is extended by the string "mode=dap2". +Thus replacing "dods" does not lose information, but rather transfers +it to the fragment for later use. + +2. Fragment -- +After the protocol is processed, the initial fragment processing occurs +by converting it to a list data structure of the form +```` + {,,,,,....} +```` + +### Macro Processing -- processmacros() + +If the fragment list produced by processuri() is non-empty, then +it is processed for "macros". Notice that if the original path +was not a URL, then the fragment list is empty and this +processing will be bypassed. In any case, It is convenient to +allow some singleton fragment keys to be expanded into larger +fragment components. In effect, those singletons act as +macros. They can help to simplify the user's URL. The term +singleton means a fragment key with no associated value: +"#bytes", for example. + +The list of fragments is searched looking for keys whose +value part is NULL or the empty string. Then the table +of macros is searched for that key and if found, then +a key and values is appended to the fragment list and the singleton +is removed. + +### Mode Inference -- processinferences() + +This function just processes the list of values associated +with the "mode" key. It is similar to a macro in that +certain mode values are added or removed based on tables +of "inferences" and "negations". +Again, the purpose is to allow users to provide simplified URL fragments. + +The list of mode values is repeatedly searched and whenever a value +is found that is in the "modeinferences" table, then the associated inference value +is appended to the list of mode values. This process stops when no changes +occur. This form of inference allows the user to specify "mode=zarr" +and have it converted to "mode=nczarr,zarr". This avoids the need for the +dispatch table code to do the same inference. + +After the inferences are made, The list of mode values is again +repeatedly searched and whenever a value +is found that is in the "modenegations" table, then the associated negation value +is removed from the list of mode values, assuming it is there. This process stops when no changes +occur. This form of inference allows the user to make sure that "mode=bytes,nczarr" +has the bytes mode take precedence by removing the "nczarr" value. Such illegal +combinations can occur because of previous processing steps. + +### Fragment List Normalization +As the fragment list is processed, duplicates appear with the same key. +A function -- cleanfragments() -- is applied to clean up the fragment list +by coalesing the values of duplicate keys and removing duplicate key values. + +### S3 Rebuild +If the URL is determined to be a reference to a resource on the Amazon S3 cloud, +then the URL needs to be converted to what is called "path format". +There are four S3 URL formats: + +1. Virtual -- ````https://.s3..amazonaws.com/```` +2. Path -- ````https://s3..amazonaws.com//```` +3. S3 -- ````s3:///```` +4. Other -- ````https:////```` + +The S3 processing converts all of these to the Path format. In the "S3" format case +it is necessary to find or default the region from examining the ".aws" directory files. + +### File Rebuild +If the URL protocol is "file" and its path is a relative file path, +then it is made absolute by prepending the path of the current working directory. + +In any case, after S3 or File rebuilds, the URL is completely +rebuilt using any modified protocol, host, path, and +fragments. The query is left unchanged in the current algorithm. +The resulting rebuilt URL is passed back to the caller. + +### Mode Key Processing +The set of values of the fragment's "mode" key are processed one by one +to see if it is possible to determine the model. +There is a table for format interpretations that maps a mode value +to the model's implementation and format. So for example, +if the mode value "dap2" is encountered, then the model +implementation is set to NC\_FORMATX\_DAP2 and the format +is set to NC\_FORMAT\_CLASSIC. + +### Non-Mode Key Processing +If processing the mode does not tell us the implementation, then +all other fragment keys are processed to see if the implementaton +(and format) can be deduced. Currently this does nothing. + +### URL Defaults +If the model is still not determined and the path is a URL, then +the implementation is defaulted to DAP2. This is for back +compatibility when all URLS implied DAP2. + +### Mode Flags +In the event that the path is not a URL, then it is necessary +to use the mode flags and the isparallel arguments to choose a model. +This is just a straight forward flag checking exercise. + +### Content Inference -- check\_file\_type() +If the path is being opened (as opposed to created), then +it may be possible to actually read the first few bytes of the +resource specified by the path and use that to determine the +model. If this succeeds, then it takes precedence over +all other model inferences. + +### Flag Consistency +Once the model is known, then the set of mode flags +is modified to be consistent with that information. +So for example, if DAP2 is the model, then all netcdf-4 mode flags +and some netcdf-3 flags are removed from the set of mode flags +because DAP2 provides only a standard netcdf-classic format. + +# 4. Adding a Standard Filter + +The standard filter system extends the netcdf-c library API to +support a fixed set of "standard" filters. This is similar to the +way that deflate and szip are currently supported. +For background, the file filter.md should be consulted. + +In general, the API for a standard filter has the following prototypes. +The case of zstandard (libzstd) is used as an example. +```` +int nc_def_var_zstandard(int ncid, int varid, int level); +int nc_inq_var_zstandard(int ncid, int varid, int* has_filterp, int* levelp); +```` +So generally the API has the ncid and the varid as fixed, and then +a list of parameters specific to the filter -- level in this case. +For the inquire function, there is an additional argument -- has_filterp -- +that is set to 1 if the filter is defined for the given variable +and is 0 if not. +The remainder of the inquiry parameters are pointers to memory +into which the parameters are stored -- levelp in this case. + +It is important to note that including a standard filter still +requires three supporting objects: + +1. The implementing library for the filter. For example, + libzstd must be installed in order to use the zstandard + API. +2. A HDF5 wrapper for the filter must be installed in the + directory pointed to by the HDF5_PLUGIN_PATH environment + variable. +3. (Optional) An NCZarr Codec implementation must be installed + in the the HDF5_PLUGIN_PATH directory. + +## Adding a New Standard Filter + +The implementation of a standard filter must be loaded from one +of several locations. + +1. It can be part of libnetcdf.so (preferred), +2. it can be loaded as part of the client code, +3. or it can be loaded as part of an external library such as libccr. + +However, the three objects listed above need to be +stored in the HDF5_PLUGIN_DIR directory, so adding a standard +filter still requires modification to the netcdf build system. +This limitation may be lifted in the future. + +### Build Changes +In order to detect a standard library, the following changes +must be made for Automake (configure.ac/Makefile.am) +and CMake (CMakeLists.txt) + +#### Configure.ac +Configure.ac must have a block that similar to this that locates +the implementing library. +```` +# See if we have libzstd +AC_CHECK_LIB([zstd],[ZSTD_compress],[have_zstd=yes],[have_zstd=no]) +if test "x$have_zstd" = "xyes" ; then + AC_SEARCH_LIBS([ZSTD_compress],[zstd zstd.dll cygzstd.dll], [], []) + AC_DEFINE([HAVE_ZSTD], [1], [if true, zstd library is available]) +fi +AC_MSG_CHECKING([whether libzstd library is available]) +AC_MSG_RESULT([${have_zstd}]) +```` +Note the the entry point (*ZSTD_compress*) is library dependent +and is used to see if the library is available. + +#### Makefile.am + +It is assumed you have an HDF5 wrapper for zstd. If you want it +to be built as part of the netcdf-c library then you need to +add the following to *netcdf-c/plugins/Makefile.am*. +```` +if HAVE_ZSTD +noinst_LTLIBRARIES += libh5zstd.la +libh5szip_la_SOURCES = H5Zzstd.c H5Zzstd.h +endif +```` + +# Need our version of szip if libsz available and we are not using HDF5 +if HAVE_SZ +noinst_LTLIBRARIES += libh5szip.la +libh5szip_la_SOURCES = H5Zszip.c H5Zszip.h +endif + +#### CMakeLists.txt +In an analog to *configure.ac*, a block like +this needs to be in *netcdf-c/CMakeLists.txt*. +```` +FIND_PACKAGE(Zstd) +set_std_filter(Zstd) +```` +The FIND_PACKAGE requires a CMake module for the filter +in the cmake/modules directory. +The *set_std_filter* function is a macro. + +An entry in the file config.h.cmake.in will also be needed. +```` +/* Define to 1 if zstd library available. */ +#cmakedefine HAVE_ZSTD 1 +```` + +### Implementation Template +As a template, here is the implementation for zstandard. +It can be used as the template for adding other standard filters. +It is currently located in *netcdf-d/libdispatch/dfilter.c*, but +could be anywhere as indicated above. +```` +#ifdef HAVE_ZSTD +int +nc_def_var_zstandard(int ncid, int varid, int level) +{ + int stat = NC_NOERR; + unsigned ulevel; + + if((stat = nc_inq_filter_avail(ncid,H5Z_FILTER_ZSTD))) goto done; + /* Filter is available */ + /* Level must be between -131072 and 22 on Zstandard v. 1.4.5 (~202009) + Earlier versions have fewer levels (especially fewer negative levels) */ + if (level < -131072 || level > 22) + return NC_EINVAL; + ulevel = (unsigned) level; /* Keep bit pattern */ + if((stat = nc_def_var_filter(ncid,varid,H5Z_FILTER_ZSTD,1,&ulevel))) goto done; +done: + return stat; +} + +int +nc_inq_var_zstandard(int ncid, int varid, int* hasfilterp, int *levelp) +{ + int stat = NC_NOERR; + size_t nparams; + unsigned params = 0; + int hasfilter = 0; + + if((stat = nc_inq_filter_avail(ncid,H5Z_FILTER_ZSTD))) goto done; + /* Filter is available */ + /* Get filter info */ + stat = nc_inq_var_filter_info(ncid,varid,H5Z_FILTER_ZSTD,&nparams,NULL); + if(stat == NC_ENOFILTER) {stat = NC_NOERR; hasfilter = 0; goto done;} + if(stat != NC_NOERR) goto done; + hasfilter = 1; + if(nparams != 1) {stat = NC_EFILTER; goto done;} + if((stat = nc_inq_var_filter_info(ncid,varid,H5Z_FILTER_ZSTD,&nparams,¶ms))) goto done; +done: + if(levelp) *levelp = (int)params; + if(hasfilterp) *hasfilterp = hasfilter; + return stat; +} +#endif /*HAVE_ZSTD*/ +```` + +# Point of Contact {#intern_poc} + +*Author*: Dennis Heimbigner
+*Email*: dmh at ucar dot edu
+*Initial Version*: 12/22/2021
+*Last Revised*: 01/25/2022 diff --git a/docs/nczarr.md b/docs/nczarr.md index af10387c60..969a403228 100644 --- a/docs/nczarr.md +++ b/docs/nczarr.md @@ -481,9 +481,9 @@ collections — High-performance dataset datatypes](https://docs.python.org/2/li [7] [XArray Zarr Encoding Specification](http://xarray.pydata.org/en/latest/internals.html#zarr-encoding-specification)
[8] [Dynamic Filter Loading](https://support.hdfgroup.org/HDF5/doc/Advanced/DynamicallyLoadedFilters/HDF5DynamicallyLoadedFilters.pdf)
[9] [Officially Registered Custom HDF5 Filters](https://portal.hdfgroup.org/display/support/Registered+Filter+Plugins)
-[10] [C-Blosc Compressor Implementation](https://github.com/Blosc/c-blosc) -[11] [Conda-forge / packages / aws-sdk-cpp] -(https://anaconda.org/conda-forge/aws-sdk-cpp)
+[10] [C-Blosc Compressor Implementation](https://github.com/Blosc/c-blosc)
+[11] [Conda-forge / packages / aws-sdk-cpp](https://anaconda.org/conda-forge/aws-sdk-cpp)
+[12] [GDAL Zarr](https://gdal.org/drivers/raster/zarr.html)
# Appendix A. Building NCZarr Support {#nczarr_build} @@ -524,8 +524,7 @@ Note also that if S3 support is enabled, then you need to have a C++ compiler in The necessary CMake flags are as follows (with defaults) -1. --DENABLE_NCZARR=off -- equivalent to the Automake _--disable-nczarr_ option. +1. -DENABLE_NCZARR=off -- equivalent to the Automake _--disable-nczarr_ option. 2. -DENABLE_NCZARR_S3=off -- equivalent to the Automake _--enable-nczarr-s3_ option. 3. -DENABLE_NCZARR_S3_TESTS=off -- equivalent to the Automake _--enable-nczarr-s3-tests_ option. @@ -562,7 +561,7 @@ Building this package from scratch has proven to be a formidable task. This appears to be due to dependencies on very specific versions of, for example, openssl. -## **nix** Build +## *\*nix\** Build For linux, the following context works. Of course your mileage may vary. * OS: ubuntu 21 @@ -682,7 +681,7 @@ Some of the relevant limits are as follows: Note that the limit is defined in terms of bytes and not (Unicode) characters. This affects the depth to which groups can be nested because the key encodes the full path name of a group. -# Appendix D. Alternative Mechanisms for Accessing Remote Datasets +# Appendix D. Alternative Mechanisms for Accessing Remote Datasets {#nczarr_altremote} The NetCDF-C library contains an alternate mechanism for accessing traditional netcdf-4 files stored in Amazon S3: The byte-range mechanism. The idea is to treat the remote data as if it was a big file. @@ -706,7 +705,7 @@ Specifically, Thredds servers support such access using the HttpServer access me https://thredds-test.unidata.ucar.edu/thredds/fileServer/irma/metar/files/METAR_20170910_0000.nc#bytes ```` -# Appendix E. AWS Selection Algorithms. +# Appendix E. AWS Selection Algorithms. {#nczarr_awsselect} If byterange support is enabled, the netcdf-c library will parse the files ```` @@ -764,7 +763,7 @@ Picking an access-key/secret-key pair is always determined by the current active profile. To choose to not use keys requires that the active profile must be "none". -# Appendix F. NCZarr Version 1 Meta-Data Representation +# Appendix F. NCZarr Version 1 Meta-Data Representation. {#nczarr_version1} In NCZarr Version 1, the NCZarr specific metadata was represented using new objects rather than as keys in existing Zarr objects. Due to conflicts with the Zarr specification, that format is deprecated in favor of the one described above. @@ -779,6 +778,26 @@ The content of these objects is the same as the contents of the corresponding ke * ''.nczarray <=> ''_NCZARR_ARRAY_'' * ''.nczattr <=> ''_NCZARR_ATTR_'' +# Appendix G. JSON Attribute Convention. {#nczarr_version1} + +An attribute may be encountered on read whose value when parsed +by JSON is a dictionary. As a special conventions, the value +converted to a string and stored as the value of the attribute +and the type of the attribute is treated as char. + +When writing a character valued attribute, it's value is examined +to see if it looks like a JSON dictionary (i.e. "{...}") +and is parseable as JSON. +If so, then the attribute value is treated as one long string, +parsed as JSON, and stored in the .zattr file in JSON form. + +These conventions are intended to help support various +attributes created by other packages where the attribute is a +complex JSON dictionary. An example is the GDAL Driver +convention [12]. The value is a complex +JSON dictionary and it is desirable to both read and write that kind of +information through the netcdf API. + # Point of Contact {#nczarr_poc} __Author__: Dennis Heimbigner