Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nccopy ignores chunk spec when input is netcdf-4 contiguous #725

Closed
Dave-Allured opened this issue Dec 20, 2017 · 20 comments · Fixed by #1087
Closed

Nccopy ignores chunk spec when input is netcdf-4 contiguous #725

Dave-Allured opened this issue Dec 20, 2017 · 20 comments · Fixed by #1087
Assignees
Milestone

Comments

@Dave-Allured
Copy link
Contributor

Environment

Linux and Mac, 64-bit
Version tested: nccopy with netcdf-C 4.4.1.1, hdf5-1.10.1

Summary

When using nccopy to convert a netcdf-4 file from contiguous to chunked, the chunk spec on the command line is ignored, and the output file contains invented chunk sizes. IMO, nccopy is not working as advertised in this case.

Remarkably, when the input file is chunked rather than contiguous, the command line chunk spec is respected. (Example not shown.)

Steps to reproduce

Test input file, 5.2 Mbytes: test31.contig.nc.gz

Run this command:

nccopy -d1 -c time/1,lat/180,lon/180 test31.contig.nc test33.chunked.nc

Input file header:

netcdf test31.contig {
dimensions:
	time = 40 ;
	lat = 180 ;
	lon = 180 ;
variables:
	float x(time, lat, lon) ;
		x:_Storage = "contiguous" ;
		x:_Endianness = "little" ;

// global attributes:
		:_NCProperties = "version=1|netcdflibversion=4.4.1.1|hdf5libversion=1.10.1" ;
		:_SuperblockVersion = 0 ;
		:_IsNetcdf4 = 1 ;
		:_Format = "netCDF-4" ;
}

Expected output chunk sizes:

		x:_ChunkSizes = 1, 180, 180 ;

Actual result with unexpected chunk sizes:

dimensions:
	time = 40 ;
	lat = 180 ;
	lon = 180 ;
variables:
	float x(time, lat, lon) ;
		x:_Storage = "chunked" ;
		x:_ChunkSizes = 20, 90, 90 ;
		x:_DeflateLevel = 1 ;
		x:_Endianness = "little" ;

// global attributes:
		:_NCProperties = "version=1|netcdflibversion=4.4.1.1|hdf5libversion=1.10.1" ;
		:_SuperblockVersion = 0 ;
		:_IsNetcdf4 = 1 ;
		:_Format = "netCDF-4" ;
}
@WardF
Copy link
Member

WardF commented Dec 20, 2017

Thanks the report, taking a look now.

@WardF WardF self-assigned this Dec 20, 2017
@WardF WardF added this to the 4.6.0 milestone Dec 20, 2017
@WardF
Copy link
Member

WardF commented Dec 20, 2017

Duplicated and am observing the same behavior. Trying to narrow down where this is happening. Also, seeing this wil hdf5libversion=1.8.19 and the master branch.

@Dave-Allured
Copy link
Contributor Author

@WardF, thanks for trying an alternate hdf5 version. This seems to me like a relatively simple bug in nccopy code, rather than a deep support library problem.

This might be related to unresolved #391, "Using nccopy, setting deflate level to 0 ignores chunking specification".

You can mark this as low priority as far as I am concerned.

@WardF
Copy link
Member

WardF commented Dec 21, 2017

The alternate HDF5 version was incidental; it's what was on hand in my dev environment, and I made a note of it so that I wouldn't forget. Thanks! Hoping it is as simple as it seems, going to try to knock it out in short order.

@WardF WardF modified the milestones: 4.6.0, 4.6.1 Jan 25, 2018
@WardF WardF modified the milestones: 4.6.1, 4.7.0 Mar 20, 2018
@adrfantini
Copy link

adrfantini commented Mar 21, 2018

I'm trying to re-chunk some files that have 1, 46, 113 chunking into 721,1,1 and I also found that -c is not working as intended, with the output chunks being 121, 47, 113, our roughly 1/6 of the variable size in each dimension (which is the usual default).
I'm using version 4.6.3.

@DennisHeimbigner
Copy link
Collaborator

DennisHeimbigner commented Mar 21, 2018

The primary problem is that the code in nccopy.c is incorrect.
At about line 673, it is testing for contig==1 to decide if the output
should be contigous or chunked. As near as I can tell, the value of contiq
is the one for the input file, so if the input file is contiquous, it forces the output file
to also be contiguous.
The mystery is the the output file actually is shown to be chunked, so
somewhere in the netcdf-c library, the contiguous flag is being changed
to be chunking. It is probably the case that with this change, default chunking
is being used.
Ok, the reason that chunking is being forced is that the output file has
deflation set.
So, the fix would be to change the condition at line 673 to properly
use contig only when no chunking was specified and the input file is
contiguous and no output compression is set.

@DennisHeimbigner
Copy link
Collaborator

Temporary workaround:

  1. copy the contiguous file with -d1 to force the output file to be chunked. e.g.

nccopy -d1 test31.contig.nc tmp.nc

  1. since the tmp file is chunked, specifying new chunking will work, so
    copy again to rechunk as desired. e.g.

nccopy -d1 -c time/1,lat/180,lon/180 tmp.nc test33.chunked.nc

@adrfantini
Copy link

In my case the input file is already chunked in time-slices (1, 46, 113), but it still does not seem to want to chunk it in time-series, like I'd like it to.

@DennisHeimbigner
Copy link
Collaborator

Do you have an example file (that is not too large)?

@adrfantini
Copy link

adrfantini commented Mar 21, 2018

Sure! Take this example. It's a small 5x280x678file chunked 1, 96, 678 with a few variables. It appears some chunking is done, but not always according to the requested chunkspec.

See the following tests:
nccopy -k4 -d1 -s -u -w -c lat/1,lon/1,time/5 nc.nc nc_rechunked.nc
The output is 3, 280, 678.

nccopy -k4 -d1 -s -u -w -c time/5 nc.nc nc_rechunked.nc
The output is 5, 96, 678.

nccopy -k4 -d1 -s -u -w -c time/5,lat/10 nc.nc nc_rechunked.nc
The output is 5, 10, 678.

nccopy -k4 -d1 -s -u -w -c time/5,lat/10,lon/10 nc.nc nc_rechunked.nc
The output is 3, 280, 678.

nccopy -k4 -d1 -s -u -w -c time/5,lon/10 nc.nc nc_rechunked.nc
The output is 5, 96, 10.

nccopy -k4 -d1 -s -u -w -c time/5,lon/1 nc.nc nc_rechunked.nc
The output is 3, 280, 678.

nccopy -k4 -d1 -s -u -w -c time/5,lon/20,lat/20 nc.nc nc_rechunked.nc
The output is 3, 280, 678.

nccopy -k4 -d1 -s -u -w -c time/5,lon/21,lat/21 nc.nc nc_rechunked.nc
The output is 5, 21, 21. Works.

nccopy -k4 -d1 -s -c time/5,lon/20,lat/20 nc.nc nc_rechunked.nc
The output is 5, 20, 20. Works.

nccopy -k4 -d1 -s -c time/5,lon/10,lat/10 nc.nc nc_rechunked.nc
The output is 5, 10, 10. Works, but this is slower.

nccopy -k4 -d1 -s -c time/5,lon/1,lat/1 nc.nc nc_rechunked.nc
The output is 5, 1, 1. Works, but this is MUCH MUCH slower.

nccopy -c time/5,lon/1,lat/1 nc.nc nc_rechunked.nc
The output is 5, 1, 1. Works, but this is MUCH MUCH slower.

nccopy -u -c time/5,lon/1,lat/1 nc.nc nc_rechunked.nc
The output is 3, 280, 678. Instantaneous.

So it appears -u is doing something strange here.

@DennisHeimbigner
Copy link
Collaborator

So part of the problem is this.
If the chunksize is too small (<= 512 bytes) and
the variable is not a record variable, then nccopy
refuses to chunk it. However, if compression was specified,
then the underlying C library will force chunking using
default chunking.
If you use the -u flag, then the output variable is no longer
a record variable hence the chunk size limit comes into play.
If -u is removed, then chunking will occur even tho the chunk size
is below nccopy's limit of 512 bytes.
It does not appear that this is properly documented in the nccopy
manual. And it is not currently possible to override that minimum chunk
size.

@adrfantini
Copy link

I think nccopy should communicate when and why it is overriding the chunking specification.

A 5x20x20=2000 floats chunk was ignored, while a 5x21x21=2205 floats one was accepted, how does this relate to the 512 bytes limit?

@DennisHeimbigner
Copy link
Collaborator

I agree, it should report when it ignores the user specified chunking.
For your examples, was the input file in both cases non-chunked?
Also, the command line arguments affects the decision.
-u brings the chunk minimum size into play
-d1 forces chunking, but if the input was contiguous, then default chunking
will be used.

  • unlimited dimensions also add complexity.
    The code in nccopy combined with the code in netcdf-c library leads to a rather
    tortuous set of rules. It clearly needs to be cleaned up.

@DennisHeimbigner
Copy link
Collaborator

DennisHeimbigner commented Mar 25, 2018

Let me propose the following set of rules for applying chunking in nccopy:

  • -> netcdf-3|cdf5
    => chunking is suppressed

  • netcdf-3|cdf5 -> netcdf-4

    • no other factors
      => output will be contiguous
      [or should it be default-chunk?]
  • netcdf-4 -> netcdf-4

    • in absence of any other factors
    • transfer input chunking to output chunking
  • any-input -> netcdf-4

    • chunk spec provided
    • output chunk spec always applied
      • warn if chunk size is too small, but do anyway
  • any-input -> netcdf-4

    • compression specified and no chunk spec
    • if input is chunked then use that for output
    • else force default chunking

Notes:

  • netcdf-4 includes netcdf-4 classic
  • add nccopy flag to force all output variables to be contiguous
    unless overridden by above rules

@czender
Copy link
Contributor

czender commented Mar 28, 2018

For netCDF4->netCDF4, when no chunking is explicitly specified on the command line, NCO maintains the input chunksizes, if any, in the output file. If compression is specified, then NCO uses the input chunksizes, if any, else the default chunking algorithm is applied. I think this makes the most sense because there are fewer surprises, e.g., input chunksizes are not ignored during compression. The other behaviors proposed above seem good to me.

@DennisHeimbigner
Copy link
Collaborator

Forgot that case. Edited comment to add.

@Dave-Allured
Copy link
Contributor Author

Dave-Allured commented Mar 29, 2018

Here is an alternate proposal for nccopy chunking rules. I think the command line should be considered first, rather than the input format.

SUMMARY: Preserve all chunking properties from the input file, except when changed on the command line.

These rules apply only when the selected output format supports chunking, i.e. for the netcdf-4 variants. Apply in the following order, independently for each variable to copy:

  1. First apply chunk sizes for each dimension explicitly specified on the command line, regardless of input format or input properties.

  2. For dimensions not named on the command line, preserve chunk sizes from the input variable. Do this independently for each dimension on each variable.

  3. If an input variable is netcdf-4 contiguous, none of its dimensions are named on the command line, and chunking is not mandated by other options, then make a contiguous output variable.

  4. Optional. Consider a small, fixed-dimension, non-chunked input variable such as netcdf-3, as a special case. When none of its dimensions are named on the command line, and chunking is not mandated by other options, then make a contiguous output variable.

  5. Handle all remaining cases when some or all chunk sizes are not determined by the command line or the input variable. This includes the non-chunked input cases such as netcdf-3, cdf5, and DAP. In these cases:

    1. Retain all chunk sizes determined by (1) and (2); and
    2. Compute the remaining chunk sizes automatically, with some reasonable algorithm.

@czender
Copy link
Contributor

czender commented Mar 29, 2018

FYI that's pretty much what NCO does, AFAICT without re-reading the code.

@adrfantini
Copy link

adrfantini commented Mar 29, 2018

The above proposals all make sense to me, what's most important is that this must be well-documented in the manual and that if any option passed by the user is ignored, there must be a warning message specifying why it was so.

@DennisHeimbigner
Copy link
Collaborator

Belatedly, it has occurred to me that part of the problem is that in the netcdf-C library
chunking is associated with variables, not dimensions. Since nccopy associates chunking
with dimensions, this forces the use of a somewhat arbitrary algorithm to convert from dimension-based chunking to variable-based chunking.
So, I think Dave's algorithm is as good as any and is pretty close to the current nccopy algorithm. I may also consider adding a flag to support variable-based chunking to nccopy.

DennisHeimbigner added a commit that referenced this issue Jul 27, 2018
After a long discussion, I implemented the rules at the end of that issue.
They are documented in nccopy.1.

Additionally, I added a new, per-variable, -c flag that allows
for the direct setting of the chunking parameters for a variable.
The form is
    -c var:c1,c2,...ck
where var is the name of the variable (possibly a fully qualified name)
and the ci are the chunksizes for that variable. It must be the case
that the rank of the variable is k. If the new form is used as well
as the old form, then the new form overrides the old form for the
specified variable. Note that multiple occurrences of the new form
-c flag may be specified.

Misc. Other fixes
1. Added -M <size> option to nccopy to specify the minimum
   allowable chunksize.
2. Removed the unused variables from bigmeta.c
   (Issue #1079)
3. Fixed failure of nc_test4/tst_filter.sh by using the new -M
   flag (#1) to allow filter test on a small chunk size.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants