to_netcdf() to automatically switch to fixed-length strings for compressed variables #2040

crusaderky · 2018-04-05T11:50:16Z

When you have fixed-length numpy arrays of unicode characters (<U...) in a dataset, and you invoke to_netcdf() without any particular encoding, they are automatically stored as variable-length strings, unless you explicitly specify {'dtype': 'S1'}.

Is this in order to save disk space in case strings vary wildly in size? I may be able to see the point in this case.
However, this approach is disastrous if variables are compressed, as any compression algorithm will reduce the zero-panning at the end of the strings to a negligible size.

My test data: a dataset with ~50 variables, of which half are strings of 10~100 english characters and the other half are floats, all on a single dimension with 12k points.

Test 1:

ds.to_netcdf('uncompressed.nc')

Result: 45MB

Test 2:

encoding = {k: {'gzip': True, 'shuffle': True} for k in ds.variables}
ds.to_netcdf('bad-compression.nc', encoding=encoding)

Result: 42MB

Test 3:

encoding = {}
for k, v in ds.variables.items():
    encoding[k] = {'gzip': True, 'shuffle': True}
    if v.dtype.kind == 'U':
        encoding[k]['dtype'] = 'S1'
ds.to_netcdf('good-compression.nc', encoding=encoding)

Result: 5MB

Proposal

In case of string variables, if no dtype is explicitly defined, to_netcdf() should dynamically assign it to S1 if compression is enabled, str if disabled.

The text was updated successfully, but these errors were encountered:

shoyer · 2018-04-06T15:47:24Z

The main reason for preferring variable length strings was that netCDF4-python always properly decoded them as unicode strings, even on Python 3. Basically, it was required to properly round-trip strings to a netCDF file on Python 3.

However, this is no longer the case, now that we specify an encoding when writing fixed length strings
(#1648). So we could potentially revisit the default behavior.

I'll admit I'm also a little surprised by how large the storage overhead turns out to be for variable length datatypes. The HDF5 docs claim it's 32 bytes per element, which would be about 10 MB or so for your dataset. And apparently it interacts poorly with compression, too.

shoyer · 2018-04-07T00:32:46Z

One potentially option would be to make choose the default behavior based on the string data type:

Fixed-width unicode arrays (np.unicode_) get written as fixed-width strings with a stored encoding.
Object arrays full of Python strings (np.object_) get written as variable width strings.

Note that fixed-width unicode in NumPy (fixed number of unicode characters) does not correspond to the same memory layout as fixed width strings in HDF5 (fixed length in bytes), but maybe it's close enough.

The main reason why we don't do any special handling for object arrays currently in xarray is that our conventions coding/decoding system has no way of marking variable length string arrays. We should probably handle this by making a custom dtype like h5py that marks variables length strings using dtype metadata: http://docs.h5py.org/en/latest/special.html#variable-length-strings

max-sixty · 2024-06-23T17:29:52Z

Trying to keep us below 1K issues — is this still current?

crusaderky · 2024-06-23T21:07:05Z

Trying to keep us below 1K issues — is this still current?

Yes and no.

I could reproduce the issue with today's stack (mamba create -n test python=3.12 xarray h5netcdf).
The compressed version with the manually-set dtype is substantially smaller than the default one. However, the default one retains the dtype on a round-trip, and I feel that no change on a round-trip is more important than compression optimization.

So I will close this issue.

import numpy as np
import xarray

LENGTH = 100
NO_CODES = 100_000

alphabet = list('abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789')
np_alphabet = np.array(alphabet, dtype="|S1")
np_codes = np.random.choice(np_alphabet, [NO_CODES, LENGTH])

rows = []
for row in np_codes:
    row = row[:np.random.randint(10, LENGTH + 1)]
    row = b''.join(row.tolist()).decode("ascii")
    rows.append(row)
rows = np.array(rows)
ds = xarray.Dataset({"x": rows})

ds.to_netcdf('uncompressed.nc', engine='h5netcdf')

encoding = {'x': {'zlib': True, 'shuffle': True}}
ds.to_netcdf('bad-compression.nc', engine='h5netcdf', encoding=encoding)

encoding = {'x': {'zlib': True, 'shuffle': True, 'dtype': 'S1'}}
ds.to_netcdf('good-compression.nc', engine='h5netcdf', encoding=encoding)

!ls -lh *.nc
# -rw-rw-r-- 1 crusaderky crusaderky 7.6M Jun 23 22:04 bad-compression.nc
# -rw-rw-r-- 1 crusaderky crusaderky 4.6M Jun 23 22:04 good-compression.nc
# -rw-rw-r-- 1 crusaderky crusaderky 8.7M Jun 23 22:04 uncompressed.nc
print(ds.x.dtype)  # <U100
print(xarray.open_dataset("uncompressed.nc", engine='h5netcdf').x.dtype)  # <U100
print(xarray.open_dataset("bad-compression.nc", engine='h5netcdf').x.dtype)  # <U100
print(xarray.open_dataset("good-compression.nc", engine='h5netcdf').x.dtype)  # object

shoyer mentioned this issue Apr 15, 2018

How should xarray serialize bytes/unicode strings across Python/netCDF versions? #2059

Open

dcherian added the topic-backends label Jan 13, 2019

kmuehlbauer mentioned this issue Mar 23, 2023

cf-coding #7654

Closed

4 tasks

ghiggi mentioned this issue May 23, 2023

open_dataset with chunks="auto" fails when a netCDF4 variables/coordinates is encoded as NC_STRING #7868

Closed

crusaderky closed this as completed Jun 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

to_netcdf() to automatically switch to fixed-length strings for compressed variables #2040

to_netcdf() to automatically switch to fixed-length strings for compressed variables #2040

crusaderky commented Apr 5, 2018

shoyer commented Apr 6, 2018

shoyer commented Apr 7, 2018

max-sixty commented Jun 23, 2024

crusaderky commented Jun 23, 2024

to_netcdf() to automatically switch to fixed-length strings for compressed variables #2040

to_netcdf() to automatically switch to fixed-length strings for compressed variables #2040

Comments

crusaderky commented Apr 5, 2018

Proposal

shoyer commented Apr 6, 2018

shoyer commented Apr 7, 2018

max-sixty commented Jun 23, 2024

crusaderky commented Jun 23, 2024