Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Separate internal netCDF-4 data model from HDF5 code #856

Closed
edhartnett opened this issue Feb 9, 2018 · 16 comments
Closed

Separate internal netCDF-4 data model from HDF5 code #856

edhartnett opened this issue Feb 9, 2018 · 16 comments

Comments

@edhartnett
Copy link
Contributor

As discussed in PR #849, it will soon be possible to separate the internal netCDF-4 data model code (i.e. lists of atts, vars, and dims, and the functions that manipulate them) from the HDF5 code. I add this issue so that there is a place to discuss the worthiness and possible implementation of this goal.

This would enable user-defined formats (#177) and the HDF4 layer to work without HDF5. It would also be a useful simplification of the code which would make the libsrc4 code smaller and easier to maintain.

After this ticket has been accomplished several other opportunities present themselves:

  • DAP has it's own internal data model (I believe), and at some point it could be simplified by using the common data model.

  • The classic code in libsrc uses a separate but related implementation of the data model. This too could eventually be removed, so that all parts of the library were using the same internal data structs.

@DennisHeimbigner
Copy link
Collaborator

The approach I was planning to take was to take each netcdf-4 structure
(e.g. NC_VAR_INFO_T etc) and separate the fields into two parts:

  1. implementation independent
  2. implementation dependent
    The Use GNUInstallDirs to install into /usr/lib64 as needed #2 fields I would then separate off into their own, implementation dependent
    structure. I would then leave an opaque (void*) ptr in the implementation
    independent part. This mirrors how the NC structure has an opaque pointer
    called ncdispatchdata that points to the dispatch specific info.
    Alternatively, we could keep a hash table that maps each implementation specific
    object to the corresponding implementation independent meta data object.

@edhartnett
Copy link
Contributor Author

The NC_VAR_INFO_T can already represent all data in the classic and enhanced models. So I would suggest that be the universal understanding (internally) of what a var is, and functions like nc_inq_var*() are already written to work off it. So I don't think anything should be taken out.

Then each user-defined format must jam its data into that model. To the extent they can do so, they take advantage of all the existing inq code that already works well.

Some parts of the model will not be used (for example, NC_CLASSIC_MODEL files cannot have user-defined types), but the code is already written to handle that. If no user-defined types are present, the list can be NULL and that's fine. The current internal data model can handle all variants of the classic and enhanced netCDF models.

I would like user-defined formats to have access to this existing model. I don't think there are going to be many user-defined formats that have a more complex model than HDF5, so I'm pretty sure they will all be able to easily fit within the existing netCDF-4 model.

In many cases, read-only access is fine, so they can then implement a dispatch layer with just a few functions, and take advantage of all the functions that already work off the existing data model.

I don't object to giving them an additional void pointer that they can use within their dispatch layer code for implementation dependent data, but I would suggest that the current netCDF enhanced data model be fully supported in the internal data structures, so that functions written to it can continue to work for many different formats, just like the inq functions currently work for both HDF4 and HDF5.

Does that fit what you thinking?

@edhartnett
Copy link
Contributor Author

OK, I think I have a better idea of what you mean. Just so we have an example:

/* This is a struct to handle the var metadata. */
typedef struct NC_VAR_INFO
{
   char *name;
   char *hdf5_name; /* used if different from name */
   int ndims;
   int *dimids;
   NC_DIM_INFO_T **dim;
   int varid;
   int natts;
   uint32_t hash;
   nc_bool_t is_new_var;        /* True if variable is newly created */
   nc_bool_t was_coord_var;     /* True if variable was a coordinate var, but either the dim or var has been renamed */
   nc_bool_t became_coord_var;  /* True if variable _became_ a coordinate var, because either the dim or var has been renamed */
   nc_bool_t fill_val_changed;  /* True if variable's fill value changes after it has been created */
   nc_bool_t attr_dirty;        /* True if variable's attributes are dirty and should be rewritten */
   nc_bool_t created;           /* Variable has already been created (_not_ that it was just created) */
   nc_bool_t written_to;        /* True if variable has data written to it */
   struct NC_TYPE_INFO *type_info;
   hid_t hdf_datasetid;
   NC_ATT_INFO_T *att;
   nc_bool_t no_fill;           /* True if no fill value is defined for var */
   void *fill_value;
   size_t *chunksizes;
   nc_bool_t contiguous;        /* True if variable is stored contiguously in HDF5 file */
   int parallel_access;         /* Type of parallel access for I/O on variable (collective or independent) */
   nc_bool_t dimscale;          /* True if var is a dimscale */
   nc_bool_t *dimscale_attached;        /* Array of flags that are true if dimscale is attached for that dim index */
   HDF5_OBJID_T *dimscale_hdf5_objids;
   nc_bool_t deflate;           /* True if var has deflate filter applied */
   int deflate_level;
   nc_bool_t shuffle;           /* True if var has shuffle filter applied */
   nc_bool_t fletcher32;        /* True if var has fletcher32 filter applied */
   size_t chunk_cache_size, chunk_cache_nelems;
   float chunk_cache_preemption;
#ifdef USE_HDF4
   /* Stuff below is for hdf4 files. */
   int sdsid;
   int hdf4_data_type;
#endif /* USE_HDF4 */
   /* Stuff for arbitrary filters */
   unsigned int filterid;
   size_t nparams;
   unsigned int* params;
   /* Stuff for diskless data files. */
   void *diskless_data;
} NC_VAR_INFO_T;

You mean there would be a void pointer (format_infop or something), In the case of HDF5 it would contian such things as the datasetid, dimscale_hdf5_objids, and any other values that have to with HDF5 specifics.

Things like name, ndims, dim, etc., would remain as part of the netCDF internal data that is independent of format.

@DennisHeimbigner
Copy link
Collaborator

Remember that the NC_VAR_INFO structure has some hdf5
specific fields like hdf5_name, native_hdf5_typeid.

@edhartnett
Copy link
Contributor Author

Yes, I get it, and I agree. A void pointer to the format-dependent data elements would work well. It would clean up the data structures very nicely.

@DennisHeimbigner
Copy link
Collaborator

A couple of other thoughts.

  1. Are chunking and filters considered part ot the netcdf4 model or are they hdf5 specific?
  2. Other formats will probably have special features (like hdf5 has chunking) that
    it might be desirable to make visible in some generic form thru the netcdf API.
    PIO probably has such, for example.

@cofinoa
Copy link

cofinoa commented Feb 13, 2018

@DennisHeimbigner and @edhartnett here's my 2 cents worth.

IMO , chunking or filters are part of the storage layer not of the netcdf4 data model layer. The netcdf3, DAP2 or DAP4 are another examples of different dispatchers (storage layers).

Yes, a special API for storage layer (implementations) features would be desirable.

@edhartnett
Copy link
Contributor Author

We can certainly separate the chunking/filter code so that it is in the HDF5 (libsrc4) directory, instead of the new internals directory (libint). We can put chunking/filter values into the proposed HDF5 var data info (let's call it NC_HDF5_VAR_INFO_T).

Recall that chunking and filters are also part of the API. So they will always be part of the libsrc4 directory for that reason. However, we can store the results in NC_HDF5_VAR_INFO_T instead.

But chunking is also used by HDF4, an addition since my departure from the project, I believe. So chunking at least is general to more than one binary format. So maybe chunksizes should stay in NC_VAR_INFO_T, and filter info, and other HDF5-only values, would move into NC_HDF5_VAR_INFO_T, and NC_VAR_INFO_T will contain a void pointer to the format specific struct.

@DennisHeimbigner
Copy link
Collaborator

NC_HDF5_VAR_INFO_T
Surely we can find a shorter set of names :-)

@DennisHeimbigner
Copy link
Collaborator

Before this goes much further, we need to have
a software architecture and design document to
codfy the API and the critical implementation
details. Something in markdown like a cross between
this https://github.com/Unidata/netcdf-c/blob/master/docs/filters.md
and this https://support.hdfgroup.org/HDF5/doc/Advanced/DynamicallyLoadedFilters/HDF5DynamicallyLoadedFilters.pdf

@edhartnett
Copy link
Contributor Author

What, you just reuse the same comment on three threads? ;-)

First I will get a working implementation, then I will document it for you. It's easier that way. ;-)

@edhartnett
Copy link
Contributor Author

I have a pretty good prototype of these changes, which I still need to work on to get memory-clear - some leaks have crept in.

But the general shape of the solution I propose is this:

  • Create new directory libhdf5, which will contain all HDF5 code.
  • libsrc4 remains the home of the data model code, but no longer includes HDF5 header files or any HDF5 calls.
  • nc4file.c and nc4internal.c have many functions that will move to libhdf5. Also some from nc4var and a scattering of functions from elsewhere in libsrc4 will move to libhdf5. Really just moving and renaming functions, not much code inside any function changes.
  • build system of course modified to handle libhdf5.
  • The only real re-coding has to do with shutting down a file. Currently we do a recursive descent through the file, and close all HDF5 objects and free all memory. In my prototype, there are now two recursive descents, the first closes HDF5 objects (and happens in libhdf5) the other frees memory and happens in libsrc4. (Only the second need be used by the HDF4 and other read-only layers.)
  • As discussed, void pointers in all the INFO structs to hold format-specific data.

In terms of documentation, I will add a (internal) documentation section about each directory (libsrc4 and libhdf5), describing what is in each, and how they work with the dispatch code. Of course I have also modified existing documentation with the code, so all the internal documentation reflects the changes.

My next steps will be:

  • Fix memory issues and any other issues in my prototype.
  • Prepare documentation as discussed above, and present it here for review.
  • Integrate hashmap changes once they have been merged to master.
  • Present PR of merged code for review.

@edhartnett
Copy link
Contributor Author

Here's some new documentation describing the libsrc4 directory, after HDF5 is removed:
/*! \file
Documentation for the NetCDF-4 internal metadata code

\internal

\defgroup metadata_code The Internal NetCDF-4 Metadata Code

The internal netCDF-4 metadata code resides in subdirectory
libsrc4. When netCDF is build with --enable-netcdf-4 (the default) the
netCDF-4 metadata code is used to manage the file metadata.

\section Dispatch Layer use of Metadata Code

Each dispatch layer may make use of the metadata code to manage the
file metadata. The metadata code can handle all nc_inq_* functions,
based on the internal metadata code.

This allows a format reading layer to be easily implemented. On
nc_open() the file is opened and all metadata are read into the
internal netCDF metadata structures. All read-only metadata functions
can then be handled by the internal metadata code.

\section Isolation of Metadata Code from Binary Formats

The libsrc4 code does not and should not include header files or types
from any of the binary formats which use special headers and formats
(HDF4, HDF5, pnetcdf).

This allows the netCDF library to be built with whatever third-party
binary formats the installation machine supports. The HDF5 library is
not required to build the libsrc4 directory.

\section Functions

The functions which comprise the internal netadata API are defined in
include/nc4internal.h. They include all the functions to add, delete,
read, and write the arrays of structs which hold the metadata for an
open file. These include:

  • NC_HDF5_FILE_INFO_T
  • NC_GRP_INFO_T
  • NC_VAR_INFO_T
  • NC_TYPE_INFO_T
  • NC_VAR_ARRAY_T
  • NC_ENUM_MEMNER_INFO_T
  • NC_FIELD_INFO_T

The handling of internal metadata is necessaryly complex, and there
are many functions to search and manipulate lists of the structs which
comprise the metadata model. Programmers developing a dispatch layer
should look to the existing layers for examples.

\section Enhanced Model

The netCDF enhanced model was specifically designed to offer (almost)
complete compatibility with existing HDF5 files. As a result, the
enhanced model contains many elements, such as user-defined types,
which are unlikely to be found in formats other than HDF5.

However, the enhanded model is fully supported by the metadata layer,
just as the rest of the model is. Some other formats use elements of
the enhanced model (ex. HDF4 uses chunking). Other format interfaces
could be extended to make fully use of the ehnanced model (ex. pnetcdf
supports internal caches, so they should be settable with our file and
variable cache functions.)

*/

@edhartnett
Copy link
Contributor Author

I have this working now in HPC NetCDF.

It was harder than I had anticipated, with many code changes. However, it does work well and separates out the HDF5 code nicely from the metadata-handling code in libsrc4.

I will put up a PR today or tomorrow with the changes.

@edhartnett
Copy link
Contributor Author

An update: much of this work is complete and merged. For example, the HDF5 and HDF4 specific fields have all been removed from the structs in nc4internal.h.

The remaining issues have to do with the following functions:

../liblib/.libs/libnetcdf.so: undefined reference to `NC_findreserved'
../liblib/.libs/libnetcdf.so: undefined reference to `NC4_buildpropinfo'
../liblib/.libs/libnetcdf.so: undefined reference to `NC4_provenance_init'
../liblib/.libs/libnetcdf.so: undefined reference to `NC4_isnetcdf4'

So @DennisHeimbigner I wonder whether provenance is supposed to be just for netCDF-4/HDF5 files? That's what it seems like in the code.

Also it seems that the special atts dealt with by nc4_get_att_special() are for netCDF-4/HDF5 files only. But I saw @WardF made a comment that implied you were going to make these special atts available for classic files as well.

The problem is that nc4_get_att_special() is in libsrc4, but calls NC4_buildpropinfo(), which is in libhdf5/nc4info.c.

If these functions are completely free of HDF5, they can be moved to libsrc4. If not, they all can be moved to libhdf5. I would like to leave most of the get_att functionality in libsrc4 (it is used by HDF4 as well), but it looks like I need to move the handling of the special atts to libhdf5.

@edhartnett
Copy link
Contributor Author

This work is complete and merged to master. I will close this ticket. Thanks Ward and Dennis for your patience. ;-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants