Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Creating a virtual zarr datacube from COG s3 objects #125

Closed
rsignell-usgs opened this issue Apr 5, 2022 · 53 comments
Closed

Creating a virtual zarr datacube from COG s3 objects #125

rsignell-usgs opened this issue Apr 5, 2022 · 53 comments
Labels
enhancement New feature or request question Further information is requested

Comments

@rsignell-usgs
Copy link

rsignell-usgs commented Apr 5, 2022

@cgohlke, as mentioned at fsspec/kerchunk#78 (comment), I'd like to try reading a time stack of geotiffs via the Zarr library.

As as test, I've got 2 COGs on S3-API-accessible object storage that I'd like to create a virtual Zarr dataset for using the tifffile library.

I can get it going if the COGS are local, but I'm a little unsure how to handle the s3 objects.

I'm trying to do:

from tifffile import TiffSequence
import zarr
import fsspec

fs = fsspec.filesystem('s3', anon=True, client_kwargs={'endpoint_url': 'https://mghp.osn.xsede.org'})

flist = fs.ls('/rsignellbucket1/lcmap/cog')

s3_flist = [f's3://{f}' for f in flist]

# try with s3 urls:
image_sequence = TiffSequence(s3_flist, pattern=r'.*_(\d+)_V12_LCPRI.tif')
with image_sequence.aszarr() as store:
    dz = zarr.open(store, mode='r')

# try with filelike objects:
file_like_object_list = [fs.open(f) for f in flist]
image_sequence = TiffSequence(file_like_object_list, pattern=r'.*_(\d+)_V12_LCPRI.tif')

but neither one works. These buckets are public access (no egress!) so this code should be reproducible.

Here's the Jupyter Notebook demonstrating the local success (except I'd like to use the native 2048x2048 chunking in the COG instead of the entire file!) and remote s3 failures.

What is the best way to approach this?

@rsignell-usgs rsignell-usgs changed the title Creating a virtual zarr datacube from s3 objects Creating a virtual zarr datacube from COG s3 objects Apr 5, 2022
@martindurant
Copy link

File ~/miniconda3/envs/pangeo/lib/python3.9/site-packages/tifffile/tifffile.py:10522, in FileHandle.open(self)
  10520 self._file = os.path.realpath(self._file)
  10521 self._dir, self._name = os.path.split(self._file)
> 10522 self._fh = open(self._file, self._mode)
  10523 self._close = True
  10524 if self._offset is None:

seems to be calling builtin open() explicitly, not allowing fsspec. I suppose you could open with simplecache:: to temporarily store local copies.

@cgohlke cgohlke added enhancement New feature or request question Further information is requested labels Apr 5, 2022
@cgohlke
Copy link
Owner

cgohlke commented Apr 5, 2022

To work around the immediate issue:

with TiffFile(fs.open(s3_flist[0])) as tif:
    shape = tif.series[0].shape
    dtype = tif.series[0].dtype

...

with image_sequence.aszarr(chunkshape=shape, dtype=dtype) as store:
    store.write_fsspec(
        'issue125.json',
        'https://mghp.osn.xsede.org/rsignellbucket1/lcmap/cog',
        codec_id='tifffile',
    )

Except for multi-file OME-TIFF series, tifffile does not currently support multiscales or chunks within file series. I am waiting for zarr to standardize "sharding" before implementing this.

Try to export the reference files for each TIFF file as follows and then edit+merge those files. Should be relatively straightforward for this case of a linear file sequence without higher dimensions or tiling:

import json
import fsspec
import zarr
import tifffile

fs = fsspec.filesystem(
    's3',
    anon=True,
    client_kwargs={'endpoint_url': 'https://mghp.osn.xsede.org'},
)
s3_flist = [f's3://{f}' for f in fs.ls('/rsignellbucket1/lcmap/cog')]

for filename in s3_flist:
    url, name = filename.rsplit('/', 1)
    with tifffile.TiffFile(fs.open(filename), name=name) as tif:
        with tif.series[0].aszarr() as store:
            store.write_fsspec(name + '.json', url=url)

aggregate = {}
for i, filename in enumerate(sorted(s3_flist)):
    url, name = filename.rsplit('/', 1)
    with open(name + '.json') as fh:
        data = fh.read()
    for key, value in json.loads(data).items():
        if key in aggregate:
            continue
        if key.endswith('.zarray'):
            value = json.loads(value)
            value['chunks'].insert(0, 1)
            value['shape'].insert(0, len(s3_flist))
            value = json.dumps(value)
        elif key[-1].isdigit():
            key = key.replace('/', f'/{i}.')
        aggregate[key] = value

with open('aggregate.json', 'w') as fh:
    fh.write(json.dumps(aggregate, indent=1))

mapper = fsspec.get_mapper(
    'reference://',
    fo='aggregate.json',
    target_protocol='file',
    remote_protocol='s3',
)

za = zarr.open(mapper, mode='r')
print(za[0])

Alternatively, it might be easier to create the reference file directly from the metadata in the TIFF files without going through a zarr store. This prints all necessary metadata

for filename in sorted(s3_flist):
    with TiffFile(fs.open(filename)) as tif:
        for level in tif.series[0].levels:
            print(level.dtype)
            print(level.shape)
            print(level.keyframe.chunks)
            print(level.keyframe.compression)
            for page in level.pages:
                for offset, bytecount in zip(
                    page.dataoffsets, page.databytecounts
                ):
                    print(offset, bytecount)

@martindurant
Copy link

I've had a go, and kerchunk's combine almost deals correctly with this case. It fails because there are no _ARRAY_DIMENSIONS, since the TIFFs don't conform to netCDF. If the output of TIFFFile is meant for xarrayy (as opposed to just zarr), it would be reasonable to add them; but also it's reasonable that kerchunk should be able to cope with their absence.

Also as a note: there are very many small arrays in the data, so kerchunk wants to inline them, and it does this serially, which is very slow from where I am.

@cgohlke , I'd love to hear your thought on zarr sharding, versus the kinds of things that kerchunk can do e.g., fsspec/kerchunk#134

@cgohlke
Copy link
Owner

cgohlke commented Apr 6, 2022

It fails because there are no _ARRAY_DIMENSIONS

I tried to add zattrs with _ARRAY_DIMENSIONS to ZarrTiffStore but it seems futile since xarray is neither able to load zarr arrays nor multiscale groups, the two use cases of ZarrTiffStore. Maybe I am missing something.

@martindurant
Copy link

Xarray can read these data, if we assign independent X/Y coordinates to each downsampling level. Here is a hand-modified version of one of the GOCs:

<xarray.Dataset>
Dimensions:  (x0: 105000, y0: 160000, x1: 52500, y1: 80000, x2: 26250, y2: 40000, x3: 13125, y3: 20000, x4: 6563, y4: 10000, x5: 3282, y5: 5000, x6: 1641, y6: 2500)
Dimensions without coordinates: x0, y0, x1, y1, x2, y2, x3, y3, x4, y4, x5, y5, x6, y6
Data variables:
    0        (x0, y0) float32 dask.array<chunksize=(2048, 2048), meta=np.ndarray>
    1        (x1, y1) float32 dask.array<chunksize=(128, 128), meta=np.ndarray>
    2        (x2, y2) float32 dask.array<chunksize=(128, 128), meta=np.ndarray>
    3        (x3, y3) float32 dask.array<chunksize=(128, 128), meta=np.ndarray>
    4        (x4, y4) float32 dask.array<chunksize=(128, 128), meta=np.ndarray>
    5        (x5, y5) float32 dask.array<chunksize=(128, 128), meta=np.ndarray>
    6        (x6, y6) float32 dask.array<chunksize=(128, 128), meta=np.ndarray>
Attributes:
    multiscales:  [{'datasets': [{'path': '0'}, {'path': '1'}, {'path': '2'},...

where the X/Ys are simply index arrays, but could be values taken to co-align the levels. Of course, one might ask whether it is useful to do this.

It is true that xarray does not handle zarr arrays, but of course kerchunk can produce arbitrary directory layouts, so making it into a group for xarray's consumption is easy.

Further, talking with zarr/OME people, it also does make sense to add an extra dimension to each of the levels, so they become 3D, even if the extra dimension is the same in each level. One would need to describe that the downsampling from level to the next was not the same across dimensions (see the multiscales attribute, above). It is an open question of how many client libraries would know what to do this such a dataset.

@cgohlke
Copy link
Owner

cgohlke commented Apr 6, 2022

Xarray can read these data, if we assign independent X/Y coordinates to each downsampling level

It is true that xarray does not handle zarr arrays, but of course kerchunk can produce arbitrary directory layouts, so making it into a group for xarray's consumption is easy.

Indeed, including the level in the array dimension names is the workaround I just implemented in tifffile. Also, specifying a groupname for a zarr array in write_fsspec makes it consumable by xarray.

Further, talking with zarr/OME people, it also does make sense to add an extra dimension to each of the levels, so they become 3D, even if the extra dimension is the same in each level.

That extra dimension would only be needed if there are not already dimensions that are not downsampled, like samples/channels or time?

@martindurant
Copy link

Further, talking with zarr/OME people, it also does make sense to add an extra dimension to each of the levels, so they become 3D, even if the extra dimension is the same in each level.

That extra dimension would only be needed if there are not already dimensions that are not downsampled, like samples/channels or time?

Indeed, there might be even more dimensions like bands/channels. IF the dimensions being concatenated along also has multiple points per input file (e.g., each input array has 2 timepoints and so a shape like (2, X, Y)), then you get into exactly what kerchunk.combine is actually for; but we need a way to know what the times for each file are and where they figure in an array's dimensions, in order to know how to combine them. Of course, you can always do this kind of thing by processing the JSON directly, and time is just one example of a dimension to concatenate along.

@cgohlke
Copy link
Owner

cgohlke commented Apr 6, 2022

@martindurant : I just noticed that the earthbigdata.py script no longer works with fsspec 2022.3.0 or main branch. The JSON file gets mapped and the xarray dataset created but the array data and plots are empty. fsspec 2022.02.0 works with the same JSON file.

mapper = fsspec.get_mapper(
'reference://',
fo='earthbigdata.json',
target_protocol='file',
remote_protocol='http',
)
# %% [markdown]
"""
### Create a xarray dataset from the mapper
Use 'mask_and_scale' to disable conversion to floating point.
"""
# %%
dataset = xarray.open_dataset(
mapper,
engine='zarr',
mask_and_scale=False,
backend_kwargs={'consolidated': False},
)
print(dataset)
# %% [markdown]
"""
### Select the Southern California region in the dataset
"""
# %%
socal = dataset.sel(latitude=slice(36, 32.5), longitude=slice(-121, -115))
print(socal)
# %% [markdown]
"""
### Plot a selection of the dataset
The few GeoTIFF files comprising the selection are transparently downloaded,
decoded, and stitched to an in-memory numpy array and plotted using matplotlib.
"""
# %%
image = socal['COH'].loc['winter', 'vv', 12]
assert image[100, 100] == 53
xarray.plot.imshow(image, size=6, aspect=1.8)
matplotlib.pyplot.show()

Python 3.9.12 (tags/v3.9.12:b28265d, Mar 23 2022, 23:52:46) [MSC v.1929 64 bit (AMD64)]
numpy 1.21.5
matplotlib 3.4.3
tifffile 2022.3.25
imagecodecs 2022.2.22
numcodecs 0.9.1
fsspec 2022.3.0
xarray 2022.3.0
zarr 2.11.3

@martindurant
Copy link

Because the reference for socal['COH'].loc['winter', 'vv', 12][100, 100], for instance, is ['https://sentinel-1-global-coherence-earthbigdata.s3.us-west-2.amazonaws.com/data/tiles/N36W121/N36W121_winter_vv_COH12.tif'], the remote protocol at

remote_protocol='http',
should be "https", not "http". The new version of ReferenceFileSystem allows you to have multiple filesystems to reference against, and the exact value of the protocol is used for matching (whereas before, the both "http" and "https" give you a filesystem which can both cope with both "http" and "https").

@cgohlke
Copy link
Owner

cgohlke commented Apr 7, 2022

Thank you! It works.

@cgohlke
Copy link
Owner

cgohlke commented Apr 8, 2022

The file issue125.json was created by this script:

# tifffile/examples/issues125.py

"""Create a fsspec ReferenceFileSystem for a sequence of TIFF files on S3

This Python script uses the tifffile and fsspec libraries to create a
multiscale ReferenceFileSystem JSON file for a sequence of cloud optimized
GeoTIFF (COG) files stored on S3. The tiles of the COG files are used as
chunks. No additional numcodecs codec needs to be registered since the COG
files use zlib compression. A xarray dataset is created from the
ReferenceFileSystem file and a subset of the dataset is ploted.

See https://github.com/cgohlke/tifffile/issues/125

"""

import tifffile  # >= 2022.4.8
import fsspec
import xarray
from matplotlib import pyplot

# get a list of cloud optimized GeoTIFF files stored on S3
remote_options = {
    'anon': True,
    'client_kwargs': {'endpoint_url': 'https://mghp.osn.xsede.org'},
}
fs = fsspec.filesystem('s3', **remote_options)
files = [f's3://{f}' for f in fs.ls('/rsignellbucket1/lcmap/cog')]

# write the ReferenceFileSystem of each file to a JSON file
with open('issue125.json', 'w', newline='\n') as jsonfile:
    for i, filename in enumerate(tifffile.natural_sorted(files)):
        url, name = filename.rsplit('/', 1)
        with fs.open(filename) as fh:
            with tifffile.TiffFile(fh, name=name) as tif:
                with tif.series[0].aszarr() as store:
                    store.write_fsspec(
                        jsonfile,
                        url=url,
                        # using an experimental API:
                        _shape=[len(files)],  # shape of file sequence
                        _axes='T',  # axes of file sequence
                        _index=[i],  # index of this file in sequence
                        _append=i != 0,  # if True, only write index keys+value
                        _close=i == len(files) - 1,  # if True, no more appends
                        # groupname='0',  # required for non-pyramidal series
                    )

# create a fsspec mapper instance from the ReferenceFileSystem file
mapper = fsspec.get_mapper(
    'reference://',
    fo='issue125.json',
    target_protocol='file',
    remote_protocol='s3',
    remote_options=remote_options,
)

# create a xarray dataset from the mapper
dataset = xarray.open_zarr(mapper, consolidated=False)
print(dataset)

# plot a slice of the 5th pyramidal level of the dataset
xarray.plot.imshow(dataset['5'][0, 32:-32, 32:-32])
pyplot.show()

@cgohlke cgohlke closed this as completed Apr 8, 2022
@rsignell-usgs
Copy link
Author

rsignell-usgs commented Apr 8, 2022

@cgohlke I tried it and it worked perfectly -- it's awesome to be able to access the different overviews in xarray!
2022-04-08_12-26-07

What do you think would be the best way to generate the x and y coordinates for xarray?
These need to be calculated from start/dx/dy TIFFTAG metadata right?

Xarray needs the full 1D x,y arrays, even if they are uniformly spaced (as they are in tiff).

These arrays would compress extremely well, however, so shouldn't take up too much space in the JSON, right?

@cgohlke
Copy link
Owner

cgohlke commented Apr 8, 2022

It should be possible to add the coordinate arrays to the JSON, similar to the earthbigdata.py script:

coordinates = {}
zarrgroup = zarr.open_group(coordinates)
zarrgroup.array(
longitude_label, data=longitude_coordinates, dtype='float32'
).attrs['_ARRAY_DIMENSIONS'] = [longitude_label]

The coordinates could be hiding in the GeoTIFF metadata TiffFile.geotiff_metadata, e.g.:

{'GTCitationGeoKey': 'Albers',
 'GTModelTypeGeoKey': <ModelType.Projected: 1>,
 'GTRasterTypeGeoKey': <RasterPixel.IsArea: 1>,
 'GeogAngularUnitsGeoKey': <Angular.Degree: 9102>,
 'GeogCitationGeoKey': 'WGS 84',
 'GeogInvFlatteningGeoKey': 298.256999999996,
 'GeogSemiMajorAxisGeoKey': 6378140.0,
 'GeographicTypeGeoKey': <GCS.WGS_84: 4326>,
 'KeyDirectoryVersion': 1,
 'KeyRevision': 1,
 'KeyRevisionMinor': 0,
 'ModelPixelScale': [30.0, 30.0, 0.0],
 'ModelTiepoint': [0.0, 0.0, 0.0, -2415585.0, 3314805.0, 0.0],
 'ProjCoordTransGeoKey': <CT.AlbersEqualArea: 11>,
 'ProjFalseEastingGeoKey': 0.0,
 'ProjFalseNorthingGeoKey': 0.0,
 'ProjLinearUnitsGeoKey': <Linear.Meter: 9001>,
 'ProjNatOriginLatGeoKey': 23.0,
 'ProjNatOriginLongGeoKey': -96.0,
 'ProjStdParallel1GeoKey': 29.5,
 'ProjStdParallel2GeoKey': 45.5,
 'ProjectedCSTypeGeoKey': <PCS.User_Defined: 32767>,
 'ProjectionGeoKey': <Proj.User_Defined: 32767>}

I just noticed that the files are color indexed (PhotometricInterpretation=PALETTE). Is that important for the ReferenceFileSystem?

@rsignell-usgs
Copy link
Author

I don't think so, but I'll check around. This is land cover classification with only 8 values:

  • Developed, Cropland, Grass/Shrub, Tree Cover, Wetland, Water, Snow/Ice and Barren.

@martindurant
Copy link

The coordinates could be hiding in the GeoTIFF metadata

Shouldn't these tags appear as .zattrs in the output references? I don't know if they belong to the variables or the top level.

@RichardScottOZ
Copy link

I just tried the above:-

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_37452/991707357.py in <module>
     14             with tifffile.TiffFile(fh, name=name) as tif:
     15                 with tif.series[0].aszarr() as store:
---> 16                     store.write_fsspec(
     17                         jsonfile,
     18                         url=url,

TypeError: write_fsspec() got an unexpected keyword argument '_shape'

@RichardScottOZ
Copy link

am I behind?

@cgohlke
Copy link
Owner

cgohlke commented May 13, 2022

am I behind?

import tifffile # >= 2022.4.8

@RichardScottOZ
Copy link

thanks!

@RichardScottOZ
Copy link

Having a look at your examples now.

@RichardScottOZ
Copy link

@RichardScottOZ
Copy link

RichardScottOZ commented May 13, 2022

Borrowing the example definitely doing some things

e.g. starts, gets a list of chunks, then I think stops one last file - am I missing something to do with adding a delimiter at the end?

{
 ".zattrs": "{\n \"_ARRAY_DIMENSIONS\": [\n  \"T\",\n  \"Y\",\n  \"X\"\n ]\n}",
 ".zarray": "{\n \"chunks\": [\n  1,\n  1,\n  5087\n ],\n \"compressor\": {\n  \"id\": \"imagecodecs_packbits\"\n },\n \"dtype\": \"<f4\",\n \"fill_value\": -1000.0,\n \"filters\": null,\n \"order\": \"C\",\n \"shape\": [\n  117,\n  4710,\n  5087\n ],\n \"zarr_format\": 2\n}",

so that looks ok

I am just throwing it a list of single band floating point standard geotiffs, not cogs

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_149996/2784632950.py in <module>
     19 
     20 # create a fsspec mapper instance from the ReferenceFileSystem file
---> 21 mapper = fsspec.get_mapper(
     22     'reference://',
     23     fo='test.json',

~\AppData\Local\Continuum\anaconda3\envs\stuartshelf\lib\site-packages\fsspec\mapping.py in get_mapper(url, check, create, missing_exceptions, alternate_root, **kwargs)
    226     """
    227     # Removing protocol here - could defer to each open() on the backend
--> 228     fs, urlpath = url_to_fs(url, **kwargs)
    229     root = alternate_root if alternate_root is not None else urlpath
    230     return FSMap(root, fs, check, create, missing_exceptions=missing_exceptions)

~\AppData\Local\Continuum\anaconda3\envs\stuartshelf\lib\site-packages\fsspec\core.py in url_to_fs(url, **kwargs)
    405         options = cls._get_kwargs_from_urls(url)
    406         update_storage_options(options, kwargs)
--> 407         fs = cls(**options)
    408         urlpath = fs._strip_protocol(url)
    409     return fs, urlpath

~\AppData\Local\Continuum\anaconda3\envs\stuartshelf\lib\site-packages\fsspec\spec.py in __call__(cls, *args, **kwargs)
     66             return cls._cache[token]
     67         else:
---> 68             obj = super().__call__(*args, **kwargs)
     69             # Setting _fs_token here causes some static linters to complain.
     70             obj._fs_token_ = token

~\AppData\Local\Continuum\anaconda3\envs\stuartshelf\lib\site-packages\fsspec\implementations\reference.py in __init__(self, fo, target, ref_storage_args, target_protocol, target_options, remote_protocol, remote_options, fs, template_overrides, simple_templates, loop, ref_type, **kwargs)
    149             self._process_dataframe()
    150         else:
--> 151             self._process_references(text, template_overrides)
    152         if fs is not None:
    153             self.fs = fs

~\AppData\Local\Continuum\anaconda3\envs\stuartshelf\lib\site-packages\fsspec\implementations\reference.py in _process_references(self, references, template_overrides)
    284     def _process_references(self, references, template_overrides=None):
    285         if isinstance(references, (str, bytes)):
--> 286             references = json.loads(references)
    287         vers = references.get("version", None)
    288         if vers is None:

ValueError: Unexpected character in found when decoding object value

@RichardScottOZ
Copy link

RichardScottOZ commented May 13, 2022

# write the ReferenceFileSystem of each file to a JSON file
with open(test.json', 'w', newline='\n') as jsonfile:
    for i, filename in enumerate(tifffile.natural_sorted(filetest[0:2])):
        url, name = filename.rsplit('/', 1)
        with fs.open(filename) as fh:
            with tifffile.TiffFile(fh, name=name) as tif:
                with tif.series[0].aszarr() as store:
                    store.write_fsspec(
                        jsonfile,
                        url=url,
                        # using an experimental API:
                        _shape=[len(files)],  # shape of file sequence
                        _axes='T',  # axes of file sequence
                        _index=[i],  # index of this file in sequence
                        _append=i != 0,  # if True, only write index keys+value
                        _close=i == len(files) - 1,  # if True, no more appends
                        # groupname='0',  # required for non-pyramidal series
                    )

# create a fsspec mapper instance from the ReferenceFileSystem file
mapper = fsspec.get_mapper(
    'reference://',
    fo='test.json',
    target_protocol='file',
    remote_protocol='s3',
    remote_options=r_opts,
)

@RichardScottOZ
Copy link

e.g. was no } at the end, how should the file close?
"1.4709.0": ["s3://testdata/test.tif", 96602874, 20507]

@RichardScottOZ
Copy link

if I add the groupname in - then, wants a group - no fileclosing still it seems

---------------------------------------------------------------------------
ContainsArrayError                        Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_149996/644436119.py in <module>
----> 1 dataset = xarray.open_zarr(mapper, consolidated=False)
      2 print(dataset)

~\AppData\Local\Continuum\anaconda3\envs\stuartshelf\lib\site-packages\xarray\backends\zarr.py in open_zarr(store, group, synchronizer, chunks, decode_cf, mask_and_scale, decode_times, concat_characters, decode_coords, drop_variables, consolidated, overwrite_encoded_chunks, chunk_store, storage_options, decode_timedelta, use_cftime, **kwargs)
    766     }
    767 
--> 768     ds = open_dataset(
    769         filename_or_obj=store,
    770         group=group,

~\AppData\Local\Continuum\anaconda3\envs\stuartshelf\lib\site-packages\xarray\backends\api.py in open_dataset(filename_or_obj, engine, chunks, cache, decode_cf, mask_and_scale, decode_times, decode_timedelta, use_cftime, concat_characters, decode_coords, drop_variables, backend_kwargs, *args, **kwargs)
    493 
    494     overwrite_encoded_chunks = kwargs.pop("overwrite_encoded_chunks", None)
--> 495     backend_ds = backend.open_dataset(
    496         filename_or_obj,
    497         drop_variables=drop_variables,

~\AppData\Local\Continuum\anaconda3\envs\stuartshelf\lib\site-packages\xarray\backends\zarr.py in open_dataset(self, filename_or_obj, mask_and_scale, decode_times, concat_characters, decode_coords, drop_variables, use_cftime, decode_timedelta, group, mode, synchronizer, consolidated, chunk_store, storage_options, stacklevel, lock)
    822 
    823         filename_or_obj = _normalize_path(filename_or_obj)
--> 824         store = ZarrStore.open_group(
    825             filename_or_obj,
    826             group=group,

~\AppData\Local\Continuum\anaconda3\envs\stuartshelf\lib\site-packages\xarray\backends\zarr.py in open_group(cls, store, mode, synchronizer, group, consolidated, consolidate_on_close, chunk_store, storage_options, append_dim, write_region, safe_chunks, stacklevel)
    384             zarr_group = zarr.open_consolidated(store, **open_kwargs)
    385         else:
--> 386             zarr_group = zarr.open_group(store, **open_kwargs)
    387         return cls(
    388             zarr_group,

~\AppData\Local\Continuum\anaconda3\envs\stuartshelf\lib\site-packages\zarr\hierarchy.py in open_group(store, mode, cache_attrs, synchronizer, path, chunk_store, storage_options)
   1164     if mode in ['r', 'r+']:
   1165         if contains_array(store, path=path):
-> 1166             raise ContainsArrayError(path)
   1167         elif not contains_group(store, path=path):
   1168             raise GroupNotFoundError(path)

ContainsArrayError: path '' contains an array

@RichardScottOZ
Copy link

RichardScottOZ commented May 13, 2022

With the 0 default group, then looks like this:-

{
 ".zgroup": "{\n \"zarr_format\": 2\n}",
 "0/.zattrs": "{\n \"_ARRAY_DIMENSIONS\": [\n  \"T\",\n  \"Y\",\n  \"X\"\n ]\n}",
 "0/.zarray": "{\n \"chunks\": [\n  1,\n  1,\n  5087\n ],\n \"compressor\": {\n  \"id\": \"imagecodecs_packbits\"\n },\n \"dtype\": \"<f4\",\n \"fill_value\": -1000.0,\n \"filters\": null,\n \"order\": \"C\",\n \"shape\": [\n  117,\n  4710,\n  5087\n ],\n \"zarr_format\": 2\n}",

same error if I actually close with a brace, too. Just trying with two files to start.

@RichardScottOZ
Copy link

Unless this is actually an xarray problem? Of which this environment hsa 0.20.1

@RichardScottOZ
Copy link

The 125 example is fine though.
image

@RichardScottOZ
Copy link

something else it doesn't like in the directory, other json etc?

@RichardScottOZ
Copy link

Moving the notebook just to see, I get:-

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_26516/644436119.py in <module>
----> 1 dataset = xarray.open_zarr(mapper, consolidated=False)
      2 print(dataset)

~\AppData\Local\Continuum\anaconda3\envs\stuartshelf\lib\site-packages\xarray\backends\zarr.py in open_zarr(store, group, synchronizer, chunks, decode_cf, mask_and_scale, decode_times, concat_characters, decode_coords, drop_variables, consolidated, overwrite_encoded_chunks, chunk_store, storage_options, decode_timedelta, use_cftime, **kwargs)
    766     }
    767 
--> 768     ds = open_dataset(
    769         filename_or_obj=store,
    770         group=group,

~\AppData\Local\Continuum\anaconda3\envs\stuartshelf\lib\site-packages\xarray\backends\api.py in open_dataset(filename_or_obj, engine, chunks, cache, decode_cf, mask_and_scale, decode_times, decode_timedelta, use_cftime, concat_characters, decode_coords, drop_variables, backend_kwargs, *args, **kwargs)
    493 
    494     overwrite_encoded_chunks = kwargs.pop("overwrite_encoded_chunks", None)
--> 495     backend_ds = backend.open_dataset(
    496         filename_or_obj,
    497         drop_variables=drop_variables,

~\AppData\Local\Continuum\anaconda3\envs\stuartshelf\lib\site-packages\xarray\backends\zarr.py in open_dataset(self, filename_or_obj, mask_and_scale, decode_times, concat_characters, decode_coords, drop_variables, use_cftime, decode_timedelta, group, mode, synchronizer, consolidated, chunk_store, storage_options, stacklevel, lock)
    836         store_entrypoint = StoreBackendEntrypoint()
    837         with close_on_error(store):
--> 838             ds = store_entrypoint.open_dataset(
    839                 store,
    840                 mask_and_scale=mask_and_scale,

~\AppData\Local\Continuum\anaconda3\envs\stuartshelf\lib\site-packages\xarray\backends\store.py in open_dataset(self, store, mask_and_scale, decode_times, concat_characters, decode_coords, drop_variables, use_cftime, decode_timedelta)
     22         decode_timedelta=None,
     23     ):
---> 24         vars, attrs = store.load()
     25         encoding = store.get_encoding()
     26 

~\AppData\Local\Continuum\anaconda3\envs\stuartshelf\lib\site-packages\xarray\backends\common.py in load(self)
    121         """
    122         variables = FrozenDict(
--> 123             (_decode_variable_name(k), v) for k, v in self.get_variables().items()
    124         )
    125         attributes = FrozenDict(self.get_attrs())

~\AppData\Local\Continuum\anaconda3\envs\stuartshelf\lib\site-packages\xarray\backends\zarr.py in get_variables(self)
    436 
    437     def get_variables(self):
--> 438         return FrozenDict(
    439             (k, self.open_store_variable(k, v)) for k, v in self.zarr_group.arrays()
    440         )

~\AppData\Local\Continuum\anaconda3\envs\stuartshelf\lib\site-packages\xarray\core\utils.py in FrozenDict(*args, **kwargs)
    474 
    475 def FrozenDict(*args, **kwargs) -> Frozen:
--> 476     return Frozen(dict(*args, **kwargs))
    477 
    478 

~\AppData\Local\Continuum\anaconda3\envs\stuartshelf\lib\site-packages\xarray\backends\zarr.py in <genexpr>(.0)
    436 
    437     def get_variables(self):
--> 438         return FrozenDict(
    439             (k, self.open_store_variable(k, v)) for k, v in self.zarr_group.arrays()
    440         )

~\AppData\Local\Continuum\anaconda3\envs\stuartshelf\lib\site-packages\zarr\hierarchy.py in _array_iter(self, keys_only, method, recurse)
    483             path = self._key_prefix + key
    484             if contains_array(self._store, path):
--> 485                 yield key if keys_only else (key, self[key])
    486             elif recurse and contains_group(self._store, path):
    487                 group = self[key]

~\AppData\Local\Continuum\anaconda3\envs\stuartshelf\lib\site-packages\zarr\hierarchy.py in __getitem__(self, item)
    339         path = self._item_path(item)
    340         if contains_array(self._store, path):
--> 341             return Array(self._store, read_only=self._read_only, path=path,
    342                          chunk_store=self._chunk_store,
    343                          synchronizer=self._synchronizer, cache_attrs=self.attrs.cache)

~\AppData\Local\Continuum\anaconda3\envs\stuartshelf\lib\site-packages\zarr\core.py in __init__(self, store, path, read_only, chunk_store, synchronizer, cache_metadata, cache_attrs, partial_decompress)
    157 
    158         # initialize metadata
--> 159         self._load_metadata()
    160 
    161         # initialize attributes

~\AppData\Local\Continuum\anaconda3\envs\stuartshelf\lib\site-packages\zarr\core.py in _load_metadata(self)
    174         """(Re)load metadata from store."""
    175         if self._synchronizer is None:
--> 176             self._load_metadata_nosync()
    177         else:
    178             mkey = self._key_prefix + array_meta_key

~\AppData\Local\Continuum\anaconda3\envs\stuartshelf\lib\site-packages\zarr\core.py in _load_metadata_nosync(self)
    203                 self._compressor = None
    204             else:
--> 205                 self._compressor = get_codec(config)
    206 
    207             # setup filters

~\AppData\Local\Continuum\anaconda3\envs\stuartshelf\lib\site-packages\numcodecs\registry.py in get_codec(config)
     31     cls = codec_registry.get(codec_id, None)
     32     if cls is None:
---> 33         raise ValueError('codec not available: %r' % codec_id)
     34     return cls.from_config(config)
     35 

ValueError: codec not available: 'imagecodecs_packbits'

1
fs = fsspec.filesystem('s3', **r_opts)
2

although the environment tells me it is there - maybe the environment has issues?

@RichardScottOZ
Copy link

Same thing, different environment, so presumably some sort of install issue.

@RichardScottOZ
Copy link

Just in case, but no luck

(base) C:\>conda activate kerchunktest

(kerchunktest) C:\>conda install -c conda-forge --force-reinstall imagecodecs
Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: C:\Users\rscott\AppData\Local\Continuum\anaconda3\envs\kerchunktest

  added / updated specs:
    - imagecodecs


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    blosc-1.21.1               |       h74325e0_2          49 KB  conda-forge
    c-blosc2-2.1.1             |       h09319c2_0         212 KB  conda-forge
    imagecodecs-2022.2.22      |   py38h6acac19_4         2.1 MB  conda-forge
    libavif-0.10.1             |       h8ffe710_0         2.3 MB  conda-forge
    ------------------------------------------------------------
                                           Total:         4.6 MB

The following NEW packages will be INSTALLED:

  aom                conda-forge/win-64::aom-3.3.0-h0e60522_1
  blosc              conda-forge/win-64::blosc-1.21.1-h74325e0_2
  c-blosc2           conda-forge/win-64::c-blosc2-2.1.1-h09319c2_0
  cfitsio            conda-forge/win-64::cfitsio-4.1.0-h5a969a9_0
  charls             conda-forge/win-64::charls-2.3.4-h39d44d4_0
  giflib             conda-forge/win-64::giflib-5.2.1-h8d14728_2
  imagecodecs        conda-forge/win-64::imagecodecs-2022.2.22-py38h6acac19_4
  jbig               conda-forge/win-64::jbig-2.1-h8d14728_2003
  jxrlib             conda-forge/win-64::jxrlib-1.1-h8ffe710_2
  lcms2              conda-forge/win-64::lcms2-2.12-h2a16943_0
  lerc               conda-forge/win-64::lerc-3.0-h0e60522_0
  libaec             conda-forge/win-64::libaec-1.0.6-h39d44d4_0
  libavif            conda-forge/win-64::libavif-0.10.1-h8ffe710_0
  libbrotlicommon    conda-forge/win-64::libbrotlicommon-1.0.9-h8ffe710_7
  libbrotlidec       conda-forge/win-64::libbrotlidec-1.0.9-h8ffe710_7
  libbrotlienc       conda-forge/win-64::libbrotlienc-1.0.9-h8ffe710_7
  libdeflate         conda-forge/win-64::libdeflate-1.10-h8ffe710_0
  libtiff            conda-forge/win-64::libtiff-4.3.0-hc4061b1_3
  libwebp-base       conda-forge/win-64::libwebp-base-1.2.2-h8ffe710_1
  libzopfli          conda-forge/win-64::libzopfli-1.0.3-h0e60522_0
  openjpeg           conda-forge/win-64::openjpeg-2.4.0-hb211442_1
  snappy             conda-forge/win-64::snappy-1.1.9-h82413e6_0
  zfp                conda-forge/win-64::zfp-0.5.5-h0e60522_8


Proceed ([y]/n)? y


Downloading and Extracting Packages
c-blosc2-2.1.1       | 212 KB    | ############################################################################ | 100%
imagecodecs-2022.2.2 | 2.1 MB    | ############################################################################ | 100%
libavif-0.10.1       | 2.3 MB    | ############################################################################ | 100%
blosc-1.21.1         | 49 KB     | ############################################################################ | 100%
Preparing transaction: done
Verifying transaction: done
Executing transaction: done

@martindurant
Copy link

Unless you have also updated numcodecs, you need to do

import imagecodecs.numcodecs
imagecodecs.numcodecs.register_codecs()

to make sure zarr knows about it.

@RichardScottOZ
Copy link

Thanks Martin! See how much further I can get tomorrow.

@cgohlke
Copy link
Owner

cgohlke commented May 13, 2022

for i, filename in enumerate(tifffile.natural_sorted(filetest[0:2])):

_close=i == len(files) - 1, # if True, no more appends

how should the file close?

Set _close to True only for the last file.

"chunks\": [\n 1,\n 1,\n 5087\n ]

That's a very inefficient chunk. Probably better to treat individual files as chunks:

Read an image stack from a series of TIFF files with a file name pattern
as numpy or zarr arrays:
>>> image_sequence = TiffSequence('temp_C0*.tif', pattern=r'_(C)(\d+)(T)(\d+)')
>>> image_sequence.shape
(1, 2)
>>> image_sequence.axes
'CT'
>>> data = image_sequence.asarray()
>>> data.shape
(1, 2, 64, 64)
>>> with image_sequence.aszarr() as store:
... zarr.open(store, mode='r')
<zarr.core.Array (1, 2, 64, 64) float64 read-only>
>>> image_sequence.close()
Write the zarr store to a fsspec ReferenceFileSystem in JSON format:
>>> with image_sequence.aszarr() as store:
... store.write_fsspec('temp.json', url='file://')
Open the fsspec ReferenceFileSystem as a zarr array:
>>> import fsspec
>>> import tifffile.numcodecs
>>> tifffile.numcodecs.register_codec()
>>> mapper = fsspec.get_mapper(
... 'reference://', fo='temp.json', target_protocol='file')
>>> zarr.open(mapper, mode='r')
<zarr.core.Array (1, 2, 64, 64) float64 read-only>

In that case, using the TIFF decoder from the imagecodecs package is more efficient. See the earthbigdata.py

@RichardScottOZ
Copy link

Thanks for that.

@RichardScottOZ
Copy link

RichardScottOZ commented May 14, 2022

OK, got this far

chunkshape = (testraster.shape[1],testraster.shape[2])
print("chunkshape:", chunkshape)
fillvalue = -1000

latitude_label = 'latitude'
latitude_coordinates = [y for y in testraster.y.values]
              
longitude_label = 'longitude'
longitude_coordinates = [x for x in testraster.x.values]

coordinates = {}
zarrgroup = zarr.open_group(coordinates)
zarrgroup.array(
    longitude_label, data=longitude_coordinates, dtype='float32'
).attrs['_ARRAY_DIMENSIONS'] = [longitude_label]

zarrgroup.array(
    latitude_label, data=latitude_coordinates, dtype='float32'
).attrs['_ARRAY_DIMENSIONS'] = [latitude_label]


coordinates_json = tifffile.ZarrStore._json(str(coordinates)).decode()

jsonfile.write(coordinates_json[:-2])  # skip the last newline and brace

#mode = 'test'  not needed here
fileseq = tifffile.TiffSequence(
    [file for file in pMine_files],
)


store = fileseq.aszarr(
    dtype='float32',
    chunkshape=chunkshape,
    fillvalue=fillvalue,
    axestiled={0: 0},
    zattrs={
        '_ARRAY_DIMENSIONS': [
            latitude_label,
            longitude_label,
        ]
    },
)
print(store)

store.write_fsspec(
    jsonfile,
    baseurl,
    groupname=mode,
    codec_id='imagecodecs_tiff',
    _append=True,
    _close=mode == 'lsmap',  # close after last store
)

jsonfile.write('\n}')
jsonfile.close()

chunkshape: (4710, 5087)
ZarrFileSequenceStore
 shape: 9420, 5087
 chunks: 4710, 5087
 dtype: float32
 fillvalue: -1000

@RichardScottOZ
Copy link

Which seems good, a json formatting error now, which I bypassed with a string conversion in the interim to check the above works.

however, returning coordinates_json = tifffile.ZarrStore._json(coordinates).decode()

gives a serialisation error

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Input In [70], in <cell line: 26>()
     17 zarrgroup.array(
     18     longitude_label, data=longitude_coordinates, dtype='float32'
     19 ).attrs['_ARRAY_DIMENSIONS'] = [longitude_label]
     21 zarrgroup.array(
     22     latitude_label, data=latitude_coordinates, dtype='float32'
     23 ).attrs['_ARRAY_DIMENSIONS'] = [latitude_label]
---> 26 coordinates_json = tifffile.ZarrStore._json(coordinates).decode()
     28 jsonfile.write(coordinates_json[:-2])  # skip the last newline and brace
     30 #mode = 'test'  not needed here

File ~\AppData\Local\Continuum\anaconda3\envs\kerchunktest\lib\site-packages\tifffile\tifffile.py:10256, in ZarrStore._json(obj)
  10253 @staticmethod
  10254 def _json(obj: Any, /) -> bytes:
  10255     """Serialize obj to a JSON formatted string."""
> 10256     return json.dumps(
  10257         obj,
  10258         indent=1,
  10259         sort_keys=True,
  10260         ensure_ascii=True,
  10261         separators=(',', ': '),
  10262     ).encode('ascii')

File ~\AppData\Local\Continuum\anaconda3\envs\kerchunktest\lib\json\__init__.py:234, in dumps(obj, skipkeys, ensure_ascii, check_circular, allow_nan, cls, indent, separators, default, sort_keys, **kw)
    232 if cls is None:
    233     cls = JSONEncoder
--> 234 return cls(
    235     skipkeys=skipkeys, ensure_ascii=ensure_ascii,
    236     check_circular=check_circular, allow_nan=allow_nan, indent=indent,
    237     separators=separators, default=default, sort_keys=sort_keys,
    238     **kw).encode(obj)

File ~\AppData\Local\Continuum\anaconda3\envs\kerchunktest\lib\json\encoder.py:201, in JSONEncoder.encode(self, o)
    199 chunks = self.iterencode(o, _one_shot=True)
    200 if not isinstance(chunks, (list, tuple)):
--> 201     chunks = list(chunks)
    202 return ''.join(chunks)

File ~\AppData\Local\Continuum\anaconda3\envs\kerchunktest\lib\json\encoder.py:431, in _make_iterencode.<locals>._iterencode(o, _current_indent_level)
    429     yield from _iterencode_list(o, _current_indent_level)
    430 elif isinstance(o, dict):
--> 431     yield from _iterencode_dict(o, _current_indent_level)
    432 else:
    433     if markers is not None:

File ~\AppData\Local\Continuum\anaconda3\envs\kerchunktest\lib\json\encoder.py:405, in _make_iterencode.<locals>._iterencode_dict(dct, _current_indent_level)
    403         else:
    404             chunks = _iterencode(value, _current_indent_level)
--> 405         yield from chunks
    406 if newline_indent is not None:
    407     _current_indent_level -= 1

File ~\AppData\Local\Continuum\anaconda3\envs\kerchunktest\lib\json\encoder.py:438, in _make_iterencode.<locals>._iterencode(o, _current_indent_level)
    436         raise ValueError("Circular reference detected")
    437     markers[markerid] = o
--> 438 o = _default(o)
    439 yield from _iterencode(o, _current_indent_level)
    440 if markers is not None:

File ~\AppData\Local\Continuum\anaconda3\envs\kerchunktest\lib\json\encoder.py:179, in JSONEncoder.default(self, o)
    160 def default(self, o):
    161     """Implement this method in a subclass such that it returns
    162     a serializable object for ``o``, or calls the base implementation
    163     (to raise a ``TypeError``).
   (...)
    177 
    178     """
--> 179     raise TypeError(f'Object of type {o.__class__.__name__} '
    180                     f'is not JSON serializable')

TypeError: Object of type bytes is not JSON serializable

@cgohlke
Copy link
Owner

cgohlke commented May 14, 2022

# base64 encode any values containing non-ascii characters
for k, v in coordinates.items():
try:
coordinates[k] = v.decode()
except UnicodeDecodeError:
coordinates[k] = 'base64:' + base64.b64encode(v).decode()
coordinates_json = tifffile.ZarrStore._json(coordinates).decode()

@RichardScottOZ
Copy link

ah, thanks - missed that bit then, probably assuming it was ok.

@RichardScottOZ
Copy link

RichardScottOZ commented May 14, 2022

testrasters are this
image

I get this error whether I include the band coordinate or not

".zgroup": "{\n    \"zarr_format\": 2\n}",
 "band/.zarray": "{\n    \"chunks\": [\n        1\n    ],\n    \"compressor\": {\n        \"blocksize\": 0,\n        \"clevel\": 5,\n        \"cname\": \"lz4\",\n        \"id\": \"blosc\",\n        \"shuffle\": 1\n    },\n    \"dtype\": \"<i4\",\n    \"fill_value\": 0,\n    \"filters\": null,\n    \"order\": \"C\",\n    \"shape\": [\n        1\n    ],\n    \"zarr_format\": 2\n}",
 "band/.zattrs": "{\n    \"_ARRAY_DIMENSIONS\": [\n        \"band\"\n    ]\n}",
 "band/0": "\u0002\u00013\u0004\u0004\u0000\u0000\u0000\u0004\u0000\u0000\u0000\u0014\u0000\u0000\u0000\u0001\u0000\u0000\u0000",
 "latitude/.zarray": "{\n    \"chunks\": [\n        4710\n    ],\n    \"compressor\": {\n        \"blocksize\": 0,\n        \"clevel\": 5,\n        \"cname\": \"lz4\",\n        \"id\": \"blosc\",\n        \"shuffle\": 1\n    },\n    \"dtype\": \"<f4\",\n    \"fill_value\": 0.0,\n    \"filters\": null,\n    \"order\": \"C\",\n    \"shape\": [\n        4710\n    ],\n    \"zarr_format\": 2\n}",
 "latitude/.zattrs": "{\n    \"_ARRAY_DIMENSIONS\": [\n        \"latitude\"\n    ]\n}",
 "latitude/0": 
...
"test/.zattrs": "{\n \"_ARRAY_DIMENSIONS\": [\n  \"band\",\n  \"latitude\",\n  \"longitude\"\n ]\n}",
 "test/.zarray": "{\n \"chunks\": [\n  1,\n  1,\n  4710,\n  5087\n ],\n \"compressor\": {\n  \"id\": \"imagecodecs_tiff\"\n },\n \"dtype\": \"<f4\",\n \"fill_value\": -1000.0,\n \"filters\": null,\n \"order\": \"C\",\n \"shape\": [\n  2,\n  1,\n  4710,\n  5087\n ],\n \"zarr_format\": 2\n}",
 "test/0.0.0.0": ["https://generated/results-20201205/testrasters.tif"],
 "test/1.0.0.0": ["https://generated/results-20201206/testrasters.tif"]
}
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Input In [127], in <cell line: 9>()
      1 mapper = fsspec.get_mapper(
      2     'reference://',
      3     fo='earthbigdata.json',
   (...)
      6     remote_options=r_opts,
      7 )
----> 9 dataset = xarray.open_dataset(
     10     mapper,
     11     engine='zarr',
     12     mask_and_scale=False,
     13     backend_kwargs={'consolidated': False},
     14 )
     15 print(dataset)

File ~\AppData\Local\Continuum\anaconda3\envs\kerchunktest\lib\site-packages\xarray\backends\api.py:495, in open_dataset(filename_or_obj, engine, chunks, cache, decode_cf, mask_and_scale, decode_times, decode_timedelta, use_cftime, concat_characters, decode_coords, drop_variables, backend_kwargs, *args, **kwargs)
    483 decoders = _resolve_decoders_kwargs(
    484     decode_cf,
    485     open_backend_dataset_parameters=backend.open_dataset_parameters,
   (...)
    491     decode_coords=decode_coords,
    492 )
    494 overwrite_encoded_chunks = kwargs.pop("overwrite_encoded_chunks", None)
--> 495 backend_ds = backend.open_dataset(
    496     filename_or_obj,
    497     drop_variables=drop_variables,
    498     **decoders,
    499     **kwargs,
    500 )
    501 ds = _dataset_from_backend_dataset(
    502     backend_ds,
    503     filename_or_obj,
   (...)
    510     **kwargs,
    511 )
    512 return ds

File ~\AppData\Local\Continuum\anaconda3\envs\kerchunktest\lib\site-packages\xarray\backends\zarr.py:814, in ZarrBackendEntrypoint.open_dataset(self, filename_or_obj, mask_and_scale, decode_times, concat_characters, decode_coords, drop_variables, use_cftime, decode_timedelta, group, mode, synchronizer, consolidated, chunk_store, storage_options, stacklevel)
    812 store_entrypoint = StoreBackendEntrypoint()
    813 with close_on_error(store):
--> 814     ds = store_entrypoint.open_dataset(
    815         store,
    816         mask_and_scale=mask_and_scale,
    817         decode_times=decode_times,
    818         concat_characters=concat_characters,
    819         decode_coords=decode_coords,
    820         drop_variables=drop_variables,
    821         use_cftime=use_cftime,
    822         decode_timedelta=decode_timedelta,
    823     )
    824 return ds

File ~\AppData\Local\Continuum\anaconda3\envs\kerchunktest\lib\site-packages\xarray\backends\store.py:24, in StoreBackendEntrypoint.open_dataset(self, store, mask_and_scale, decode_times, concat_characters, decode_coords, drop_variables, use_cftime, decode_timedelta)
     12 def open_dataset(
     13     self,
     14     store,
   (...)
     22     decode_timedelta=None,
     23 ):
---> 24     vars, attrs = store.load()
     25     encoding = store.get_encoding()
     27     vars, attrs, coord_names = conventions.decode_cf_variables(
     28         vars,
     29         attrs,
   (...)
     36         decode_timedelta=decode_timedelta,
     37     )

File ~\AppData\Local\Continuum\anaconda3\envs\kerchunktest\lib\site-packages\xarray\backends\common.py:123, in AbstractDataStore.load(self)
    101 def load(self):
    102     """
    103     This loads the variables and attributes simultaneously.
    104     A centralized loading function makes it easier to create
   (...)
    120     are requested, so care should be taken to make sure its fast.
    121     """
    122     variables = FrozenDict(
--> 123         (_decode_variable_name(k), v) for k, v in self.get_variables().items()
    124     )
    125     attributes = FrozenDict(self.get_attrs())
    126     return variables, attributes

File ~\AppData\Local\Continuum\anaconda3\envs\kerchunktest\lib\site-packages\xarray\backends\zarr.py:422, in ZarrStore.get_variables(self)
    421 def get_variables(self):
--> 422     return FrozenDict(
    423         (k, self.open_store_variable(k, v)) for k, v in self.zarr_group.arrays()
    424     )

File ~\AppData\Local\Continuum\anaconda3\envs\kerchunktest\lib\site-packages\xarray\core\utils.py:473, in FrozenDict(*args, **kwargs)
    472 def FrozenDict(*args, **kwargs) -> Frozen:
--> 473     return Frozen(dict(*args, **kwargs))

File ~\AppData\Local\Continuum\anaconda3\envs\kerchunktest\lib\site-packages\xarray\backends\zarr.py:423, in <genexpr>(.0)
    421 def get_variables(self):
    422     return FrozenDict(
--> 423         (k, self.open_store_variable(k, v)) for k, v in self.zarr_group.arrays()
    424     )

File ~\AppData\Local\Continuum\anaconda3\envs\kerchunktest\lib\site-packages\xarray\backends\zarr.py:419, in ZarrStore.open_store_variable(self, name, zarr_array)
    416 if getattr(zarr_array, "fill_value") is not None:
    417     attributes["_FillValue"] = zarr_array.fill_value
--> 419 return Variable(dimensions, data, attributes, encoding)

File ~\AppData\Local\Continuum\anaconda3\envs\kerchunktest\lib\site-packages\xarray\core\variable.py:305, in Variable.__init__(self, dims, data, attrs, encoding, fastpath)
    285 """
    286 Parameters
    287 ----------
   (...)
    302     unrecognized encoding items.
    303 """
    304 self._data = as_compatible_data(data, fastpath=fastpath)
--> 305 self._dims = self._parse_dimensions(dims)
    306 self._attrs = None
    307 self._encoding = None

File ~\AppData\Local\Continuum\anaconda3\envs\kerchunktest\lib\site-packages\xarray\core\variable.py:573, in Variable._parse_dimensions(self, dims)
    571 dims = tuple(dims)
    572 if len(dims) != self.ndim:
--> 573     raise ValueError(
    574         f"dimensions {dims} must have the same length as the "
    575         f"number of data dimensions, ndim={self.ndim}"
    576     )
    577 return dims

ValueError: dimensions ('latitude', 'longitude') must have the same length as the number of data dimensions, ndim=3

@RichardScottOZ
Copy link

Ok, looks like it was keeping the first try - trialling this is a notebook, so when I reran was more up to date.

@RichardScottOZ
Copy link

Trying with something useful

image

So I presumably have something dimensionally wrong somewhere - or a protocol?

image

As it should look like Australia

This is just plotting a copy of the first one of these I have locally

image

the same via rioxarray directly from s3

image

with the same creds

@RichardScottOZ
Copy link

It seems I had to add another dimension to match and have it work, which isn't in Christoph's earthbigdata script, but haven't worked out what I have done wrong yet. e.g. the first map is just all the fill value of -1000 it looks like.

@RichardScottOZ
Copy link

RichardScottOZ commented May 14, 2022

Ok, looks like autogenerated urls in the zarr are https, changed my incorrect base url to s3 and all good! So protocol it was.

image

@RichardScottOZ
Copy link

So, those are small lores files. How would you approach chunking generic geotiffs this way? I had been turning things into xarray zarr, which had useful defaults that worked ok.

@RichardScottOZ
Copy link

for i, filename in enumerate(tifffile.natural_sorted(filetest[0:2])):

_close=i == len(files) - 1, # if True, no more appends

how should the file close?

Set _close to True only for the last file.

"chunks\": [\n 1,\n 1,\n 5087\n ]

That's a very inefficient chunk. Probably better to treat individual files as chunks:

Read an image stack from a series of TIFF files with a file name pattern
as numpy or zarr arrays:
>>> image_sequence = TiffSequence('temp_C0*.tif', pattern=r'_(C)(\d+)(T)(\d+)')
>>> image_sequence.shape
(1, 2)
>>> image_sequence.axes
'CT'
>>> data = image_sequence.asarray()
>>> data.shape
(1, 2, 64, 64)
>>> with image_sequence.aszarr() as store:
... zarr.open(store, mode='r')
<zarr.core.Array (1, 2, 64, 64) float64 read-only>
>>> image_sequence.close()
Write the zarr store to a fsspec ReferenceFileSystem in JSON format:
>>> with image_sequence.aszarr() as store:
... store.write_fsspec('temp.json', url='file://')
Open the fsspec ReferenceFileSystem as a zarr array:
>>> import fsspec
>>> import tifffile.numcodecs
>>> tifffile.numcodecs.register_codec()
>>> mapper = fsspec.get_mapper(
... 'reference://', fo='temp.json', target_protocol='file')
>>> zarr.open(mapper, mode='r')
<zarr.core.Array (1, 2, 64, 64) float64 read-only>

In that case, using the TIFF decoder from the imagecodecs package is more efficient. See the earthbigdata.py

Have to decide what to do for a 40gb geotiff with 15 bands for example, chunkwise.

@RichardScottOZ
Copy link

The 125 example is fine though. image

although in this case will need to post-add the code to flip the map the right way round when plotting.

@RichardScottOZ
Copy link

Or add coords?

@RichardScottOZ
Copy link

So next question is approaching something like this:-

<xarray.Dataset>
Dimensions:    (band: 23, model_run: 1, y: 4710, x: 5087)
Coordinates:
  * band       (band) int32 1 2 3 4 5 6 7 8 9 10 ... 15 16 17 18 19 20 21 22 23
  * model_run  (model_run) int32 20201205
  * x          (x) float32 113.0 113.0 113.0 113.0 ... 154.0 154.0 154.0 154.0
  * y          (y) float32 -10.0 -10.01 -10.02 -10.03 ... -43.98 -43.99 -44.0
Data variables:
    pMine      (model_run, band, y, x) float32 ...

e.g. 23 different model varieties in one - this is too big for a 'one file' chunk'. Depending how big they are, strip sized chunks might be ok [if it is 50GB], but probably not at the 2GB level, e.g. one Sentinel band or the file example above.

@RichardScottOZ
Copy link

Not a particularly useful thing to do, but a trial :-

chunkshape: (23, 4710, 5087)
['20201205']
ZarrFileSequenceStore
 shape: 1, 23, 4710, 5087
 chunks: 1, 23, 4710, 5087
 dtype: float32
 fillvalue: -1000

dataset['pMine'].max()

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Input In [11], in <cell line: 1>()
----> 1 dataset['pMine'].max()

File ~\AppData\Local\Continuum\anaconda3\envs\kerchunktest\lib\site-packages\xarray\core\common.py:58, in ImplementsArrayReduce._reduce_method.<locals>.wrapped_func(self, dim, axis, skipna, **kwargs)
     57 def wrapped_func(self, dim=None, axis=None, skipna=None, **kwargs):
---> 58     return self.reduce(func, dim, axis, skipna=skipna, **kwargs)

File ~\AppData\Local\Continuum\anaconda3\envs\kerchunktest\lib\site-packages\xarray\core\dataarray.py:2696, in DataArray.reduce(self, func, dim, axis, keep_attrs, keepdims, **kwargs)
   2654 def reduce(
   2655     self,
   2656     func: Callable[..., Any],
   (...)
   2661     **kwargs: Any,
   2662 ) -> DataArray:
   2663     """Reduce this array by applying `func` along some dimension(s).
   2664 
   2665     Parameters
   (...)
   2693         summarized data and the indicated dimension(s) removed.
   2694     """
-> 2696     var = self.variable.reduce(func, dim, axis, keep_attrs, keepdims, **kwargs)
   2697     return self._replace_maybe_drop_dims(var)

File ~\AppData\Local\Continuum\anaconda3\envs\kerchunktest\lib\site-packages\xarray\core\variable.py:1806, in Variable.reduce(self, func, dim, axis, keep_attrs, keepdims, **kwargs)
   1804         data = func(self.data, axis=axis, **kwargs)
   1805     else:
-> 1806         data = func(self.data, **kwargs)
   1808 if getattr(data, "shape", ()) == self.shape:
   1809     dims = self.dims

File ~\AppData\Local\Continuum\anaconda3\envs\kerchunktest\lib\site-packages\xarray\core\variable.py:339, in Variable.data(self)
    337     return self._data
    338 else:
--> 339     return self.values

File ~\AppData\Local\Continuum\anaconda3\envs\kerchunktest\lib\site-packages\xarray\core\variable.py:512, in Variable.values(self)
    509 @property
    510 def values(self):
    511     """The variable's data as a numpy.ndarray"""
--> 512     return _as_array_or_item(self._data)

File ~\AppData\Local\Continuum\anaconda3\envs\kerchunktest\lib\site-packages\xarray\core\variable.py:252, in _as_array_or_item(data)
    238 def _as_array_or_item(data):
    239     """Return the given values as a numpy array, or as an individual item if
    240     it's a 0d datetime64 or timedelta64 array.
    241 
   (...)
    250     TODO: remove this (replace with np.asarray) once these issues are fixed
    251     """
--> 252     data = np.asarray(data)
    253     if data.ndim == 0:
    254         if data.dtype.kind == "M":

File ~\AppData\Local\Continuum\anaconda3\envs\kerchunktest\lib\site-packages\xarray\core\indexing.py:552, in MemoryCachedArray.__array__(self, dtype)
    551 def __array__(self, dtype=None):
--> 552     self._ensure_cached()
    553     return np.asarray(self.array, dtype=dtype)

File ~\AppData\Local\Continuum\anaconda3\envs\kerchunktest\lib\site-packages\xarray\core\indexing.py:549, in MemoryCachedArray._ensure_cached(self)
    547 def _ensure_cached(self):
    548     if not isinstance(self.array, NumpyIndexingAdapter):
--> 549         self.array = NumpyIndexingAdapter(np.asarray(self.array))

File ~\AppData\Local\Continuum\anaconda3\envs\kerchunktest\lib\site-packages\xarray\core\indexing.py:522, in CopyOnWriteArray.__array__(self, dtype)
    521 def __array__(self, dtype=None):
--> 522     return np.asarray(self.array, dtype=dtype)

File ~\AppData\Local\Continuum\anaconda3\envs\kerchunktest\lib\site-packages\xarray\core\indexing.py:423, in LazilyIndexedArray.__array__(self, dtype)
    421 def __array__(self, dtype=None):
    422     array = as_indexable(self.array)
--> 423     return np.asarray(array[self.key], dtype=None)

File ~\AppData\Local\Continuum\anaconda3\envs\kerchunktest\lib\site-packages\xarray\backends\zarr.py:73, in ZarrArrayWrapper.__getitem__(self, key)
     71 array = self.get_array()
     72 if isinstance(key, indexing.BasicIndexer):
---> 73     return array[key.tuple]
     74 elif isinstance(key, indexing.VectorizedIndexer):
     75     return array.vindex[
     76         indexing._arrayize_vectorized_indexer(key, self.shape).tuple
     77     ]

File ~\AppData\Local\Continuum\anaconda3\envs\kerchunktest\lib\site-packages\zarr\core.py:720, in Array.__getitem__(self, selection)
    718     result = self.vindex[selection]
    719 else:
--> 720     result = self.get_basic_selection(pure_selection, fields=fields)
    721 return result

File ~\AppData\Local\Continuum\anaconda3\envs\kerchunktest\lib\site-packages\zarr\core.py:846, in Array.get_basic_selection(self, selection, out, fields)
    843     return self._get_basic_selection_zd(selection=selection, out=out,
    844                                         fields=fields)
    845 else:
--> 846     return self._get_basic_selection_nd(selection=selection, out=out,
    847                                         fields=fields)

File ~\AppData\Local\Continuum\anaconda3\envs\kerchunktest\lib\site-packages\zarr\core.py:889, in Array._get_basic_selection_nd(self, selection, out, fields)
    883 def _get_basic_selection_nd(self, selection, out=None, fields=None):
    884     # implementation of basic selection for array with at least one dimension
    885 
    886     # setup indexer
    887     indexer = BasicIndexer(selection, self)
--> 889     return self._get_selection(indexer=indexer, out=out, fields=fields)

File ~\AppData\Local\Continuum\anaconda3\envs\kerchunktest\lib\site-packages\zarr\core.py:1179, in Array._get_selection(self, indexer, out, fields)
   1173 if not hasattr(self.chunk_store, "getitems") or \
   1174    any(map(lambda x: x == 0, self.shape)):
   1175     # sequentially get one key at a time from storage
   1176     for chunk_coords, chunk_selection, out_selection in indexer:
   1177 
   1178         # load chunk selection into output array
-> 1179         self._chunk_getitem(chunk_coords, chunk_selection, out, out_selection,
   1180                             drop_axes=indexer.drop_axes, fields=fields)
   1181 else:
   1182     # allow storage to get multiple items at once
   1183     lchunk_coords, lchunk_selection, lout_selection = zip(*indexer)

File ~\AppData\Local\Continuum\anaconda3\envs\kerchunktest\lib\site-packages\zarr\core.py:1883, in Array._chunk_getitem(self, chunk_coords, chunk_selection, out, out_selection, drop_axes, fields)
   1880         out[out_selection] = fill_value
   1882 else:
-> 1883     self._process_chunk(out, cdata, chunk_selection, drop_axes,
   1884                         out_is_ndarray, fields, out_selection)

File ~\AppData\Local\Continuum\anaconda3\envs\kerchunktest\lib\site-packages\zarr\core.py:1791, in Array._process_chunk(self, out, cdata, chunk_selection, drop_axes, out_is_ndarray, fields, out_selection, partial_read_decode)
   1789     if isinstance(cdata, PartialReadBuffer):
   1790         cdata = cdata.read_full()
-> 1791     self._compressor.decode(cdata, dest)
   1792 else:
   1793     chunk = ensure_ndarray(cdata).view(self._dtype)

File ~\AppData\Local\Continuum\anaconda3\envs\kerchunktest\lib\site-packages\imagecodecs\numcodecs.py:812, in Tiff.decode(self, buf, out)
    811 def decode(self, buf, out=None):
--> 812     return imagecodecs.tiff_decode(
    813         buf,
    814         index=self.index,
    815         asrgb=self.asrgb,
    816         verbose=self.verbose,
    817         out=out,
    818     )

File imagecodecs\_tiff.pyx:327, in imagecodecs._tiff.tiff_decode()

File imagecodecs\_shared.pyx:164, in imagecodecs._shared._create_array()

ValueError: invalid output shape (1, 23, 4710, 5087) (4710, 5087, 23)

@cgohlke
Copy link
Owner

cgohlke commented Jan 23, 2023

Looks like the dataset was removed:

>>> import fsspec
>>> fs = fsspec.filesystem('s3', anon=True, client_kwargs={'endpoint_url': 'https://mghp.osn.xsede.org'})
>>> fs.ls('/rsignellbucket1/lcmap/cog')
[]

Edit: Never mind. Works for me now.

@rsignell-usgs
Copy link
Author

@cgohlke , indeed every once in a while it seems there is a glitch on the open storage network. :(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants