How should xarray serialize bytes/unicode strings across Python/netCDF versions? #2059

shoyer · 2018-04-15T19:36:55Z

netCDF string types

We have several options for storing strings in netCDF files:

NC_CHAR: netCDF's legacy character type. The closest match is NumPy 'S1' dtype. In principle, it's supposed to be able to store arbitrary bytes. On HDF5, it uses an UTF-8 encoded string with a fixed-size of 1 (but note that HDF5 does not complain about storing arbitrary bytes).
NC_STRING: netCDF's newer variable length string type. It's only available on netCDF4 (not netCDF3). It corresponds to an HDF5 variable-length string with UTF-8 encoding.
NC_CHAR with an _Encoding attribute: xarray and netCDF4-Python support an ad-hoc convention for storing unicode strings in NC_CHAR data-types, by adding an attribute {'_Encoding': 'UTF-8'}. The data is still stored as fixed width strings, but xarray (and netCDF4-Python) can decode them as unicode.

NC_STRING would seem like a clear win in cases where it's supported, but as @crusaderky points out in #2040, it actually results in much larger netCDF files in many cases than using character arrays, which are more easily compressed. Nonetheless, we currently default to storing unicode strings in NC_STRING, because it's the most portable option -- every tool that handles HDF5 and netCDF4 should be able to read it properly as unicode strings.

NumPy/Python string types

On the Python side, our options are perhaps even more confusing:

NumPy's dtype=np.string_ corresponds to fixed-length bytes. This is the default dtype for strings on Python 2, because on Python 2 strings are the same as bytes.
NumPy's dtype=np.unicode_ corresponds to fixed-length unicode. This is the default dtype for strings on Python 3, because on Python 3 strings are the same as unicode.
Strings are also commonly stored in numpy arrays with dtype=np.object_, as arrays of either bytes or unicode objects. This is a pragmatic choice, because otherwise NumPy has no support for variable length strings. We also use this (like pandas) to mark missing values with np.nan.

Like pandas, we are pretty liberal with converting back and forth between fixed-length (np.string/np.unicode_) and variable-length (object dtype) representations of strings as necessary. This works pretty well, though converting from object arrays in particular has downsides, since it cannot be done lazily with dask.

Current behavior of xarray

Currently, xarray uses the same behavior on Python 2/3. The priority was faithfully round-tripping data from a particular version of Python to netCDF and back, which the current serialization behavior achieves:

Python version	NetCDF version	NumPy datatype	NetCDF datatype
Python 2	NETCDF3	np.string_ / str	NC_CHAR
Python 2	NETCDF4	np.string_ / str	NC_CHAR
Python 3	NETCDF3	np.string_ / bytes	NC_CHAR
Python 3	NETCDF4	np.string_ / bytes	NC_CHAR
Python 2	NETCDF3	np.unicode_ / unicode	NC_CHAR with UTF-8 encoding
Python 2	NETCDF4	np.unicode_ / unicode	NC_STRING
Python 3	NETCDF3	np.unicode_ / str	NC_CHAR with UTF-8 encoding
Python 3	NETCDF4	np.unicode_ / str	NC_STRING
Python 2	NETCDF3	object bytes/str	NC_CHAR
Python 2	NETCDF4	object bytes/str	NC_CHAR
Python 3	NETCDF3	object bytes	NC_CHAR
Python 3	NETCDF4	object bytes	NC_CHAR
Python 2	NETCDF3	object unicode	NC_CHAR with UTF-8 encoding
Python 2	NETCDF4	object unicode	NC_STRING
Python 3	NETCDF3	object unicode/str	NC_CHAR with UTF-8 encoding
Python 3	NETCDF4	object unicode/str	NC_STRING

This can also be selected explicitly for most data-types by setting dtype in encoding:

'S1' for NC_CHAR (with or without encoding)
str for NC_STRING (though I'm not 100% sure it works properly currently when given bytes)

Script for generating table:

from __future__ import print_function
import xarray as xr
import uuid
import netCDF4
import numpy as np
import sys

for dtype_name, value in [
    ('np.string_ / ' + type(b'').__name__, np.array([b'abc'])),
    ('np.unicode_ / ' + type(u'').__name__, np.array([u'abc'])),
    ('object bytes/' + type(b'').__name__, np.array([b'abc'], dtype=object)),
    ('object unicode/' + type(u'').__name__, np.array([u'abc'], dtype=object)),
]:
    for format in ['NETCDF3_64BIT', 'NETCDF4']:
        filename = str(uuid.uuid4()) + '.nc'
        xr.Dataset({'data': value}).to_netcdf(filename, format=format)
        with netCDF4.Dataset(filename) as f:
            var = f.variables['data']
            disk_dtype = var.dtype
            has_encoding = hasattr(var, '_Encoding')
        disk_dtype_name = (('NC_CHAR' if disk_dtype == 'S1' else 'NC_STRING') +
                           (' with UTF-8 encoding' if has_encoding else ''))
        print('|', 'Python %i' % sys.version_info[0],
              '|', format[:7],
              '|', dtype_name,
              '|', disk_dtype_name,
              '|')

Potential alternatives

The main option I'm considering is switching to default to NC_CHAR with UTF-8 encoding for np.string_ / str and object bytes/str on Python 2. The current behavior could be explicitly toggled by setting an encoding of {'_Encoding': None}.

This would imply two changes:

Attempting to serialize arbitrary bytes (on Python 2) would start raising an error -- anything that isn't ASCII would require explicitly disabling _Encoding.
Strings read back from disk on Python 2 would come back as unicode instead of bytes.

This implicit conversion would be consistent with Python 2's general handling of bytes/unicode, and facilitate reading netCDF files on Python 3 that were written with Python 2.

The counter-argument is that it may not be worth changing this at this late point, given that we will be sunsetting Python 2 support by year's end.

The text was updated successfully, but these errors were encountered:

fmaussion · 2018-04-16T14:33:20Z

Thanks a lot Stephan for writing that up!

The counter-argument is that it may not be worth changing this at this late point, given that we will be sunsetting Python 2 support by year's end.

This would be my personal opinion here. I you feel like this is something you'd like to provide before the last py2-compatible xarray comes out than I'm fine with it, but it shouldn't have top-priority...

rhkleijn · 2018-08-10T14:14:12Z

Currently, the dtype does not seem to roundtrip faithfully.
When I write np.unicode_ / str to file, it gets transformed to object when I subsequently read it from disk. I am using xarray 0.10.8 with Python 3 on Windows.

This can be reproduced by inserting the following line in the script above (and adjusting the print statement accordingly)

with xr.open_dataset(filename) as ds:
    read_dtype = ds['data'].dtype

which gives:

Python version	NetCDF version	NumPy datatype	NetCDF datatype	Numpy datatype (read)
Python 3	NETCDF3	np.string_ / bytes	NC_CHAR	\|S3
Python 3	NETCDF4	np.string_ / bytes	NC_CHAR	\|S3
Python 3	NETCDF3	np.unicode_ / str	NC_CHAR with UTF-8 encoding	object
Python 3	NETCDF4	np.unicode_ / str	NC_STRING	object
Python 3	NETCDF3	object bytes/bytes	NC_CHAR	\|S3
Python 3	NETCDF4	object bytes/bytes	NC_CHAR	\|S3
Python 3	NETCDF3	object unicode/str	NC_CHAR with UTF-8 encoding	object
Python 3	NETCDF4	object unicode/str	NC_STRING	object

Also object bytes/bytes seems not to roundtrip nicely as it seems to be converted to np.string_ / bytes.

Is it possible to preserve dtype when persisting xarray Datasets/DataArrays to disk?

shoyer · 2018-08-12T05:27:10Z

Is it possible to preserve dtype when persisting xarray Datasets/DataArrays to disk?

Unfortunately, there is a frustrating disconnect between string data types in NumPy and netCDF.

This could be done in principle, but it would require adding our xarray specific convention on top of netCDF. I'm not sure this would be worth it -- we already end up converting np.unicode_ to object dtypes in many operations because we need a string dtype that can support missing values.

For reading data from disk, we use object dtype because we don't know the length of the longest string until we actually read the data, so this would be incompatible with lazy loading.

NowanIlfideme · 2020-11-19T10:02:35Z

This may be relevant here, maybe not, but it appears the HDF5 backend is also at odds with all the above serialization.

Our internal project's dependencies changed, and that moved the h5py version from 2.10 to 3.1; apparently there was a breaking change that meant unicode strings were either encoded or decoded as bytes. Thankfully we had a test for that, but figuring out what was wrong was difficult.

Essentially, netCDF4 files that were round-tripped to a BytesIO (via an HDF5 backend) had unicode strings converted to bytes. I'm not sure whether it was the encoding or decoding part, likely decoding, judging by the docs:

https://docs.h5py.org/en/stable/strings.html
https://docs.h5py.org/en/stable/whatsnew/3.0.html#breaking-changes-deprecations

This might require even more special-casing to achieve consistent behavior for xarray users who don't really want to go into backend details (like me 😋).

kmuehlbauer · 2020-11-19T10:08:16Z

@NowanIlfideme h5py 3 changes with regard to strings is tracked also in #4570

dcherian added the topic-backends label Jan 13, 2019

JustinElms mentioned this issue Oct 28, 2021

Cannot subset areas in NC-3 or NC-4 Classic formats. DFO-Ocean-Navigator/Ocean-Data-Map-Project#919

Closed

ghiggi mentioned this issue Mar 9, 2022

problems using nco with netcdf files created by disdrodb ltelab/disdrodb#10

Closed

ahartikainen mentioned this issue Oct 12, 2022

Drop object-dtyped variables and coords before saving arviz-devs/arviz#2134

Closed

5 tasks

kmuehlbauer mentioned this issue Mar 23, 2023

cf-coding #7654

Closed

4 tasks

ghiggi mentioned this issue May 23, 2023

open_dataset with chunks="auto" fails when a netCDF4 variables/coordinates is encoded as NC_STRING #7868

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How should xarray serialize bytes/unicode strings across Python/netCDF versions? #2059

How should xarray serialize bytes/unicode strings across Python/netCDF versions? #2059

shoyer commented Apr 15, 2018

fmaussion commented Apr 16, 2018

rhkleijn commented Aug 10, 2018

shoyer commented Aug 12, 2018

NowanIlfideme commented Nov 19, 2020

kmuehlbauer commented Nov 19, 2020

How should xarray serialize bytes/unicode strings across Python/netCDF versions? #2059

How should xarray serialize bytes/unicode strings across Python/netCDF versions? #2059

Comments

shoyer commented Apr 15, 2018

netCDF string types

NumPy/Python string types

Current behavior of xarray

Potential alternatives

fmaussion commented Apr 16, 2018

rhkleijn commented Aug 10, 2018

shoyer commented Aug 12, 2018

NowanIlfideme commented Nov 19, 2020

kmuehlbauer commented Nov 19, 2020