Creating a virtual zarr datacube from COG s3 objects #125

rsignell-usgs · 2022-04-05T15:19:19Z

@cgohlke, as mentioned at fsspec/kerchunk#78 (comment), I'd like to try reading a time stack of geotiffs via the Zarr library.

As as test, I've got 2 COGs on S3-API-accessible object storage that I'd like to create a virtual Zarr dataset for using the tifffile library.

I can get it going if the COGS are local, but I'm a little unsure how to handle the s3 objects.

I'm trying to do:

from tifffile import TiffSequence
import zarr
import fsspec

fs = fsspec.filesystem('s3', anon=True, client_kwargs={'endpoint_url': 'https://mghp.osn.xsede.org'})

flist = fs.ls('/rsignellbucket1/lcmap/cog')

s3_flist = [f's3://{f}' for f in flist]

# try with s3 urls:
image_sequence = TiffSequence(s3_flist, pattern=r'.*_(\d+)_V12_LCPRI.tif')
with image_sequence.aszarr() as store:
    dz = zarr.open(store, mode='r')

# try with filelike objects:
file_like_object_list = [fs.open(f) for f in flist]
image_sequence = TiffSequence(file_like_object_list, pattern=r'.*_(\d+)_V12_LCPRI.tif')

but neither one works. These buckets are public access (no egress!) so this code should be reproducible.

Here's the Jupyter Notebook demonstrating the local success (except I'd like to use the native 2048x2048 chunking in the COG instead of the entire file!) and remote s3 failures.

What is the best way to approach this?

The text was updated successfully, but these errors were encountered:

martindurant · 2022-04-05T20:03:09Z

File ~/miniconda3/envs/pangeo/lib/python3.9/site-packages/tifffile/tifffile.py:10522, in FileHandle.open(self)
  10520 self._file = os.path.realpath(self._file)
  10521 self._dir, self._name = os.path.split(self._file)
> 10522 self._fh = open(self._file, self._mode)
  10523 self._close = True
  10524 if self._offset is None:

seems to be calling builtin open() explicitly, not allowing fsspec. I suppose you could open with simplecache:: to temporarily store local copies.

cgohlke · 2022-04-05T23:47:50Z

To work around the immediate issue:

with TiffFile(fs.open(s3_flist[0])) as tif:
    shape = tif.series[0].shape
    dtype = tif.series[0].dtype

...

with image_sequence.aszarr(chunkshape=shape, dtype=dtype) as store:
    store.write_fsspec(
        'issue125.json',
        'https://mghp.osn.xsede.org/rsignellbucket1/lcmap/cog',
        codec_id='tifffile',
    )

Except for multi-file OME-TIFF series, tifffile does not currently support multiscales or chunks within file series. I am waiting for zarr to standardize "sharding" before implementing this.

Try to export the reference files for each TIFF file as follows and then edit+merge those files. Should be relatively straightforward for this case of a linear file sequence without higher dimensions or tiling:

import json
import fsspec
import zarr
import tifffile

fs = fsspec.filesystem(
    's3',
    anon=True,
    client_kwargs={'endpoint_url': 'https://mghp.osn.xsede.org'},
)
s3_flist = [f's3://{f}' for f in fs.ls('/rsignellbucket1/lcmap/cog')]

for filename in s3_flist:
    url, name = filename.rsplit('/', 1)
    with tifffile.TiffFile(fs.open(filename), name=name) as tif:
        with tif.series[0].aszarr() as store:
            store.write_fsspec(name + '.json', url=url)

aggregate = {}
for i, filename in enumerate(sorted(s3_flist)):
    url, name = filename.rsplit('/', 1)
    with open(name + '.json') as fh:
        data = fh.read()
    for key, value in json.loads(data).items():
        if key in aggregate:
            continue
        if key.endswith('.zarray'):
            value = json.loads(value)
            value['chunks'].insert(0, 1)
            value['shape'].insert(0, len(s3_flist))
            value = json.dumps(value)
        elif key[-1].isdigit():
            key = key.replace('/', f'/{i}.')
        aggregate[key] = value

with open('aggregate.json', 'w') as fh:
    fh.write(json.dumps(aggregate, indent=1))

mapper = fsspec.get_mapper(
    'reference://',
    fo='aggregate.json',
    target_protocol='file',
    remote_protocol='s3',
)

za = zarr.open(mapper, mode='r')
print(za[0])

Alternatively, it might be easier to create the reference file directly from the metadata in the TIFF files without going through a zarr store. This prints all necessary metadata

for filename in sorted(s3_flist):
    with TiffFile(fs.open(filename)) as tif:
        for level in tif.series[0].levels:
            print(level.dtype)
            print(level.shape)
            print(level.keyframe.chunks)
            print(level.keyframe.compression)
            for page in level.pages:
                for offset, bytecount in zip(
                    page.dataoffsets, page.databytecounts
                ):
                    print(offset, bytecount)

martindurant · 2022-04-06T01:21:23Z

I've had a go, and kerchunk's combine almost deals correctly with this case. It fails because there are no _ARRAY_DIMENSIONS, since the TIFFs don't conform to netCDF. If the output of TIFFFile is meant for xarrayy (as opposed to just zarr), it would be reasonable to add them; but also it's reasonable that kerchunk should be able to cope with their absence.

Also as a note: there are very many small arrays in the data, so kerchunk wants to inline them, and it does this serially, which is very slow from where I am.

@cgohlke , I'd love to hear your thought on zarr sharding, versus the kinds of things that kerchunk can do e.g., fsspec/kerchunk#134

cgohlke · 2022-04-06T17:27:16Z

It fails because there are no _ARRAY_DIMENSIONS

I tried to add zattrs with _ARRAY_DIMENSIONS to ZarrTiffStore but it seems futile since xarray is neither able to load zarr arrays nor multiscale groups, the two use cases of ZarrTiffStore. Maybe I am missing something.

martindurant · 2022-04-06T19:38:51Z

Xarray can read these data, if we assign independent X/Y coordinates to each downsampling level. Here is a hand-modified version of one of the GOCs:

<xarray.Dataset>
Dimensions:  (x0: 105000, y0: 160000, x1: 52500, y1: 80000, x2: 26250, y2: 40000, x3: 13125, y3: 20000, x4: 6563, y4: 10000, x5: 3282, y5: 5000, x6: 1641, y6: 2500)
Dimensions without coordinates: x0, y0, x1, y1, x2, y2, x3, y3, x4, y4, x5, y5, x6, y6
Data variables:
    0        (x0, y0) float32 dask.array<chunksize=(2048, 2048), meta=np.ndarray>
    1        (x1, y1) float32 dask.array<chunksize=(128, 128), meta=np.ndarray>
    2        (x2, y2) float32 dask.array<chunksize=(128, 128), meta=np.ndarray>
    3        (x3, y3) float32 dask.array<chunksize=(128, 128), meta=np.ndarray>
    4        (x4, y4) float32 dask.array<chunksize=(128, 128), meta=np.ndarray>
    5        (x5, y5) float32 dask.array<chunksize=(128, 128), meta=np.ndarray>
    6        (x6, y6) float32 dask.array<chunksize=(128, 128), meta=np.ndarray>
Attributes:
    multiscales:  [{'datasets': [{'path': '0'}, {'path': '1'}, {'path': '2'},...

where the X/Ys are simply index arrays, but could be values taken to co-align the levels. Of course, one might ask whether it is useful to do this.

It is true that xarray does not handle zarr arrays, but of course kerchunk can produce arbitrary directory layouts, so making it into a group for xarray's consumption is easy.

Further, talking with zarr/OME people, it also does make sense to add an extra dimension to each of the levels, so they become 3D, even if the extra dimension is the same in each level. One would need to describe that the downsampling from level to the next was not the same across dimensions (see the multiscales attribute, above). It is an open question of how many client libraries would know what to do this such a dataset.

cgohlke · 2022-04-06T20:11:53Z

Xarray can read these data, if we assign independent X/Y coordinates to each downsampling level

It is true that xarray does not handle zarr arrays, but of course kerchunk can produce arbitrary directory layouts, so making it into a group for xarray's consumption is easy.

Indeed, including the level in the array dimension names is the workaround I just implemented in tifffile. Also, specifying a groupname for a zarr array in write_fsspec makes it consumable by xarray.

Further, talking with zarr/OME people, it also does make sense to add an extra dimension to each of the levels, so they become 3D, even if the extra dimension is the same in each level.

That extra dimension would only be needed if there are not already dimensions that are not downsampled, like samples/channels or time?

martindurant · 2022-04-06T20:18:46Z

Further, talking with zarr/OME people, it also does make sense to add an extra dimension to each of the levels, so they become 3D, even if the extra dimension is the same in each level.

That extra dimension would only be needed if there are not already dimensions that are not downsampled, like samples/channels or time?

Indeed, there might be even more dimensions like bands/channels. IF the dimensions being concatenated along also has multiple points per input file (e.g., each input array has 2 timepoints and so a shape like (2, X, Y)), then you get into exactly what kerchunk.combine is actually for; but we need a way to know what the times for each file are and where they figure in an array's dimensions, in order to know how to combine them. Of course, you can always do this kind of thing by processing the JSON directly, and time is just one example of a dimension to concatenate along.

cgohlke · 2022-04-06T23:39:07Z

@martindurant : I just noticed that the earthbigdata.py script no longer works with fsspec 2022.3.0 or main branch. The JSON file gets mapped and the xarray dataset created but the array data and plots are empty. fsspec 2022.02.0 works with the same JSON file.

tifffile/examples/earthbigdata.py

Lines 463 to 507 in 3428213

    
           mapper = fsspec.get_mapper( 
        
               'reference://', 
        
               fo='earthbigdata.json', 
        
               target_protocol='file', 
        
               remote_protocol='http', 
        
           ) 
        
           # %% [markdown] 
        
           """ 
        
           ### Create a xarray dataset from the mapper 
        
           Use 'mask_and_scale' to disable conversion to floating point. 
        
           """ 
        
           # %% 
        
           dataset = xarray.open_dataset( 
        
               mapper, 
        
               engine='zarr', 
        
               mask_and_scale=False, 
        
               backend_kwargs={'consolidated': False}, 
        
           ) 
        
           print(dataset) 
        
           # %% [markdown] 
        
           """ 
        
           ### Select the Southern California region in the dataset 
        
           """ 
        
           # %% 
        
           socal = dataset.sel(latitude=slice(36, 32.5), longitude=slice(-121, -115)) 
        
           print(socal) 
        
           # %% [markdown] 
        
           """ 
        
           ### Plot a selection of the dataset 
        
           The few GeoTIFF files comprising the selection are transparently downloaded, 
        
           decoded, and stitched to an in-memory numpy array and plotted using matplotlib. 
        
           """ 
        
           # %% 
        
           image = socal['COH'].loc['winter', 'vv', 12] 
        
           assert image[100, 100] == 53 
        
           xarray.plot.imshow(image, size=6, aspect=1.8) 
        
           matplotlib.pyplot.show()

Python 3.9.12 (tags/v3.9.12:b28265d, Mar 23 2022, 23:52:46) [MSC v.1929 64 bit (AMD64)]
numpy 1.21.5
matplotlib 3.4.3
tifffile 2022.3.25
imagecodecs 2022.2.22
numcodecs 0.9.1
fsspec 2022.3.0
xarray 2022.3.0
zarr 2.11.3

martindurant · 2022-04-07T01:22:01Z

Because the reference for socal['COH'].loc['winter', 'vv', 12][100, 100], for instance, is ['https://sentinel-1-global-coherence-earthbigdata.s3.us-west-2.amazonaws.com/data/tiles/N36W121/N36W121_winter_vv_COH12.tif'], the remote protocol at

tifffile/examples/earthbigdata.py

Line 467 in 3428213

remote_protocol='http',

should be "https", not "http". The new version of ReferenceFileSystem allows you to have multiple filesystems to reference against, and the exact value of the protocol is used for matching (whereas before, the both "http" and "https" give you a filesystem which can both cope with both "http" and "https").

cgohlke · 2022-04-07T01:27:38Z

Thank you! It works.

cgohlke · 2022-04-08T16:10:51Z

The file issue125.json was created by this script:

# tifffile/examples/issues125.py

"""Create a fsspec ReferenceFileSystem for a sequence of TIFF files on S3

This Python script uses the tifffile and fsspec libraries to create a
multiscale ReferenceFileSystem JSON file for a sequence of cloud optimized
GeoTIFF (COG) files stored on S3. The tiles of the COG files are used as
chunks. No additional numcodecs codec needs to be registered since the COG
files use zlib compression. A xarray dataset is created from the
ReferenceFileSystem file and a subset of the dataset is ploted.

See https://github.com/cgohlke/tifffile/issues/125

"""

import tifffile  # >= 2022.4.8
import fsspec
import xarray
from matplotlib import pyplot

# get a list of cloud optimized GeoTIFF files stored on S3
remote_options = {
    'anon': True,
    'client_kwargs': {'endpoint_url': 'https://mghp.osn.xsede.org'},
}
fs = fsspec.filesystem('s3', **remote_options)
files = [f's3://{f}' for f in fs.ls('/rsignellbucket1/lcmap/cog')]

# write the ReferenceFileSystem of each file to a JSON file
with open('issue125.json', 'w', newline='\n') as jsonfile:
    for i, filename in enumerate(tifffile.natural_sorted(files)):
        url, name = filename.rsplit('/', 1)
        with fs.open(filename) as fh:
            with tifffile.TiffFile(fh, name=name) as tif:
                with tif.series[0].aszarr() as store:
                    store.write_fsspec(
                        jsonfile,
                        url=url,
                        # using an experimental API:
                        _shape=[len(files)],  # shape of file sequence
                        _axes='T',  # axes of file sequence
                        _index=[i],  # index of this file in sequence
                        _append=i != 0,  # if True, only write index keys+value
                        _close=i == len(files) - 1,  # if True, no more appends
                        # groupname='0',  # required for non-pyramidal series
                    )

# create a fsspec mapper instance from the ReferenceFileSystem file
mapper = fsspec.get_mapper(
    'reference://',
    fo='issue125.json',
    target_protocol='file',
    remote_protocol='s3',
    remote_options=remote_options,
)

# create a xarray dataset from the mapper
dataset = xarray.open_zarr(mapper, consolidated=False)
print(dataset)

# plot a slice of the 5th pyramidal level of the dataset
xarray.plot.imshow(dataset['5'][0, 32:-32, 32:-32])
pyplot.show()

rsignell-usgs · 2022-04-08T16:31:46Z

@cgohlke I tried it and it worked perfectly -- it's awesome to be able to access the different overviews in xarray!

What do you think would be the best way to generate the x and y coordinates for xarray?
These need to be calculated from start/dx/dy TIFFTAG metadata right?

Xarray needs the full 1D x,y arrays, even if they are uniformly spaced (as they are in tiff).

These arrays would compress extremely well, however, so shouldn't take up too much space in the JSON, right?

cgohlke · 2022-04-08T17:02:43Z

It should be possible to add the coordinate arrays to the JSON, similar to the earthbigdata.py script:

tifffile/examples/earthbigdata.py

Lines 197 to 202 in 3f88ca5

    
           coordinates = {} 
        
           zarrgroup = zarr.open_group(coordinates) 
        
           zarrgroup.array( 
        
               longitude_label, data=longitude_coordinates, dtype='float32' 
        
           ).attrs['_ARRAY_DIMENSIONS'] = [longitude_label]

The coordinates could be hiding in the GeoTIFF metadata TiffFile.geotiff_metadata, e.g.:

{'GTCitationGeoKey': 'Albers',
 'GTModelTypeGeoKey': <ModelType.Projected: 1>,
 'GTRasterTypeGeoKey': <RasterPixel.IsArea: 1>,
 'GeogAngularUnitsGeoKey': <Angular.Degree: 9102>,
 'GeogCitationGeoKey': 'WGS 84',
 'GeogInvFlatteningGeoKey': 298.256999999996,
 'GeogSemiMajorAxisGeoKey': 6378140.0,
 'GeographicTypeGeoKey': <GCS.WGS_84: 4326>,
 'KeyDirectoryVersion': 1,
 'KeyRevision': 1,
 'KeyRevisionMinor': 0,
 'ModelPixelScale': [30.0, 30.0, 0.0],
 'ModelTiepoint': [0.0, 0.0, 0.0, -2415585.0, 3314805.0, 0.0],
 'ProjCoordTransGeoKey': <CT.AlbersEqualArea: 11>,
 'ProjFalseEastingGeoKey': 0.0,
 'ProjFalseNorthingGeoKey': 0.0,
 'ProjLinearUnitsGeoKey': <Linear.Meter: 9001>,
 'ProjNatOriginLatGeoKey': 23.0,
 'ProjNatOriginLongGeoKey': -96.0,
 'ProjStdParallel1GeoKey': 29.5,
 'ProjStdParallel2GeoKey': 45.5,
 'ProjectedCSTypeGeoKey': <PCS.User_Defined: 32767>,
 'ProjectionGeoKey': <Proj.User_Defined: 32767>}

I just noticed that the files are color indexed (PhotometricInterpretation=PALETTE). Is that important for the ReferenceFileSystem?

rsignell-usgs · 2022-04-08T18:59:04Z

I don't think so, but I'll check around. This is land cover classification with only 8 values:

Developed, Cropland, Grass/Shrub, Tree Cover, Wetland, Water, Snow/Ice and Barren.

martindurant · 2022-04-08T21:41:35Z

The coordinates could be hiding in the GeoTIFF metadata

Shouldn't these tags appear as .zattrs in the output references? I don't know if they belong to the variables or the top level.

RichardScottOZ · 2022-05-13T03:05:34Z

I just tried the above:-

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_37452/991707357.py in <module>
     14             with tifffile.TiffFile(fh, name=name) as tif:
     15                 with tif.series[0].aszarr() as store:
---> 16                     store.write_fsspec(
     17                         jsonfile,
     18                         url=url,

TypeError: write_fsspec() got an unexpected keyword argument '_shape'

RichardScottOZ · 2022-05-13T03:05:50Z

am I behind?

cgohlke · 2022-05-13T04:54:10Z

am I behind?

tifffile/examples/issue125.py

Line 16 in 375d97f

import tifffile # >= 2022.4.8

RichardScottOZ · 2022-05-13T05:02:28Z

thanks!

RichardScottOZ · 2022-05-13T05:02:51Z

Having a look at your examples now.

RichardScottOZ · 2022-05-13T05:03:15Z

This is my current use case. https://discourse.pangeo.io/t/model-geotiff-postprocessing-at-scale-what-would-you-do/2461/3

RichardScottOZ · 2022-05-13T05:49:21Z

Borrowing the example definitely doing some things

e.g. starts, gets a list of chunks, then I think stops one last file - am I missing something to do with adding a delimiter at the end?

{
 ".zattrs": "{\n \"_ARRAY_DIMENSIONS\": [\n  \"T\",\n  \"Y\",\n  \"X\"\n ]\n}",
 ".zarray": "{\n \"chunks\": [\n  1,\n  1,\n  5087\n ],\n \"compressor\": {\n  \"id\": \"imagecodecs_packbits\"\n },\n \"dtype\": \"<f4\",\n \"fill_value\": -1000.0,\n \"filters\": null,\n \"order\": \"C\",\n \"shape\": [\n  117,\n  4710,\n  5087\n ],\n \"zarr_format\": 2\n}",

so that looks ok

I am just throwing it a list of single band floating point standard geotiffs, not cogs

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_149996/2784632950.py in <module>
     19 
     20 # create a fsspec mapper instance from the ReferenceFileSystem file
---> 21 mapper = fsspec.get_mapper(
     22     'reference://',
     23     fo='test.json',

~\AppData\Local\Continuum\anaconda3\envs\stuartshelf\lib\site-packages\fsspec\mapping.py in get_mapper(url, check, create, missing_exceptions, alternate_root, **kwargs)
    226     """
    227     # Removing protocol here - could defer to each open() on the backend
--> 228     fs, urlpath = url_to_fs(url, **kwargs)
    229     root = alternate_root if alternate_root is not None else urlpath
    230     return FSMap(root, fs, check, create, missing_exceptions=missing_exceptions)

~\AppData\Local\Continuum\anaconda3\envs\stuartshelf\lib\site-packages\fsspec\core.py in url_to_fs(url, **kwargs)
    405         options = cls._get_kwargs_from_urls(url)
    406         update_storage_options(options, kwargs)
--> 407         fs = cls(**options)
    408         urlpath = fs._strip_protocol(url)
    409     return fs, urlpath

~\AppData\Local\Continuum\anaconda3\envs\stuartshelf\lib\site-packages\fsspec\spec.py in __call__(cls, *args, **kwargs)
     66             return cls._cache[token]
     67         else:
---> 68             obj = super().__call__(*args, **kwargs)
     69             # Setting _fs_token here causes some static linters to complain.
     70             obj._fs_token_ = token

~\AppData\Local\Continuum\anaconda3\envs\stuartshelf\lib\site-packages\fsspec\implementations\reference.py in __init__(self, fo, target, ref_storage_args, target_protocol, target_options, remote_protocol, remote_options, fs, template_overrides, simple_templates, loop, ref_type, **kwargs)
    149             self._process_dataframe()
    150         else:
--> 151             self._process_references(text, template_overrides)
    152         if fs is not None:
    153             self.fs = fs

~\AppData\Local\Continuum\anaconda3\envs\stuartshelf\lib\site-packages\fsspec\implementations\reference.py in _process_references(self, references, template_overrides)
    284     def _process_references(self, references, template_overrides=None):
    285         if isinstance(references, (str, bytes)):
--> 286             references = json.loads(references)
    287         vers = references.get("version", None)
    288         if vers is None:

ValueError: Unexpected character in found when decoding object value

RichardScottOZ · 2022-05-13T05:50:08Z

# write the ReferenceFileSystem of each file to a JSON file
with open(test.json', 'w', newline='\n') as jsonfile:
    for i, filename in enumerate(tifffile.natural_sorted(filetest[0:2])):
        url, name = filename.rsplit('/', 1)
        with fs.open(filename) as fh:
            with tifffile.TiffFile(fh, name=name) as tif:
                with tif.series[0].aszarr() as store:
                    store.write_fsspec(
                        jsonfile,
                        url=url,
                        # using an experimental API:
                        _shape=[len(files)],  # shape of file sequence
                        _axes='T',  # axes of file sequence
                        _index=[i],  # index of this file in sequence
                        _append=i != 0,  # if True, only write index keys+value
                        _close=i == len(files) - 1,  # if True, no more appends
                        # groupname='0',  # required for non-pyramidal series
                    )

# create a fsspec mapper instance from the ReferenceFileSystem file
mapper = fsspec.get_mapper(
    'reference://',
    fo='test.json',
    target_protocol='file',
    remote_protocol='s3',
    remote_options=r_opts,
)

RichardScottOZ · 2022-05-13T05:57:47Z

e.g. was no } at the end, how should the file close?
"1.4709.0": ["s3://testdata/test.tif", 96602874, 20507]

RichardScottOZ · 2022-05-13T06:50:37Z

if I add the groupname in - then, wants a group - no fileclosing still it seems

---------------------------------------------------------------------------
ContainsArrayError                        Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_149996/644436119.py in <module>
----> 1 dataset = xarray.open_zarr(mapper, consolidated=False)
      2 print(dataset)

~\AppData\Local\Continuum\anaconda3\envs\stuartshelf\lib\site-packages\xarray\backends\zarr.py in open_zarr(store, group, synchronizer, chunks, decode_cf, mask_and_scale, decode_times, concat_characters, decode_coords, drop_variables, consolidated, overwrite_encoded_chunks, chunk_store, storage_options, decode_timedelta, use_cftime, **kwargs)
    766     }
    767 
--> 768     ds = open_dataset(
    769         filename_or_obj=store,
    770         group=group,

~\AppData\Local\Continuum\anaconda3\envs\stuartshelf\lib\site-packages\xarray\backends\api.py in open_dataset(filename_or_obj, engine, chunks, cache, decode_cf, mask_and_scale, decode_times, decode_timedelta, use_cftime, concat_characters, decode_coords, drop_variables, backend_kwargs, *args, **kwargs)
    493 
    494     overwrite_encoded_chunks = kwargs.pop("overwrite_encoded_chunks", None)
--> 495     backend_ds = backend.open_dataset(
    496         filename_or_obj,
    497         drop_variables=drop_variables,

~\AppData\Local\Continuum\anaconda3\envs\stuartshelf\lib\site-packages\xarray\backends\zarr.py in open_dataset(self, filename_or_obj, mask_and_scale, decode_times, concat_characters, decode_coords, drop_variables, use_cftime, decode_timedelta, group, mode, synchronizer, consolidated, chunk_store, storage_options, stacklevel, lock)
    822 
    823         filename_or_obj = _normalize_path(filename_or_obj)
--> 824         store = ZarrStore.open_group(
    825             filename_or_obj,
    826             group=group,

~\AppData\Local\Continuum\anaconda3\envs\stuartshelf\lib\site-packages\xarray\backends\zarr.py in open_group(cls, store, mode, synchronizer, group, consolidated, consolidate_on_close, chunk_store, storage_options, append_dim, write_region, safe_chunks, stacklevel)
    384             zarr_group = zarr.open_consolidated(store, **open_kwargs)
    385         else:
--> 386             zarr_group = zarr.open_group(store, **open_kwargs)
    387         return cls(
    388             zarr_group,

~\AppData\Local\Continuum\anaconda3\envs\stuartshelf\lib\site-packages\zarr\hierarchy.py in open_group(store, mode, cache_attrs, synchronizer, path, chunk_store, storage_options)
   1164     if mode in ['r', 'r+']:
   1165         if contains_array(store, path=path):
-> 1166             raise ContainsArrayError(path)
   1167         elif not contains_group(store, path=path):
   1168             raise GroupNotFoundError(path)

ContainsArrayError: path '' contains an array

RichardScottOZ · 2022-05-13T07:08:17Z

With the 0 default group, then looks like this:-

{
 ".zgroup": "{\n \"zarr_format\": 2\n}",
 "0/.zattrs": "{\n \"_ARRAY_DIMENSIONS\": [\n  \"T\",\n  \"Y\",\n  \"X\"\n ]\n}",
 "0/.zarray": "{\n \"chunks\": [\n  1,\n  1,\n  5087\n ],\n \"compressor\": {\n  \"id\": \"imagecodecs_packbits\"\n },\n \"dtype\": \"<f4\",\n \"fill_value\": -1000.0,\n \"filters\": null,\n \"order\": \"C\",\n \"shape\": [\n  117,\n  4710,\n  5087\n ],\n \"zarr_format\": 2\n}",

same error if I actually close with a brace, too. Just trying with two files to start.

RichardScottOZ · 2022-05-13T07:41:25Z

Unless this is actually an xarray problem? Of which this environment hsa 0.20.1

RichardScottOZ · 2022-05-13T07:42:50Z

The 125 example is fine though.

RichardScottOZ · 2022-05-13T07:53:07Z

something else it doesn't like in the directory, other json etc?

RichardScottOZ · 2022-05-13T07:56:12Z

Moving the notebook just to see, I get:-

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_26516/644436119.py in <module>
----> 1 dataset = xarray.open_zarr(mapper, consolidated=False)
      2 print(dataset)

~\AppData\Local\Continuum\anaconda3\envs\stuartshelf\lib\site-packages\xarray\backends\zarr.py in open_zarr(store, group, synchronizer, chunks, decode_cf, mask_and_scale, decode_times, concat_characters, decode_coords, drop_variables, consolidated, overwrite_encoded_chunks, chunk_store, storage_options, decode_timedelta, use_cftime, **kwargs)
    766     }
    767 
--> 768     ds = open_dataset(
    769         filename_or_obj=store,
    770         group=group,

~\AppData\Local\Continuum\anaconda3\envs\stuartshelf\lib\site-packages\xarray\backends\api.py in open_dataset(filename_or_obj, engine, chunks, cache, decode_cf, mask_and_scale, decode_times, decode_timedelta, use_cftime, concat_characters, decode_coords, drop_variables, backend_kwargs, *args, **kwargs)
    493 
    494     overwrite_encoded_chunks = kwargs.pop("overwrite_encoded_chunks", None)
--> 495     backend_ds = backend.open_dataset(
    496         filename_or_obj,
    497         drop_variables=drop_variables,

~\AppData\Local\Continuum\anaconda3\envs\stuartshelf\lib\site-packages\xarray\backends\zarr.py in open_dataset(self, filename_or_obj, mask_and_scale, decode_times, concat_characters, decode_coords, drop_variables, use_cftime, decode_timedelta, group, mode, synchronizer, consolidated, chunk_store, storage_options, stacklevel, lock)
    836         store_entrypoint = StoreBackendEntrypoint()
    837         with close_on_error(store):
--> 838             ds = store_entrypoint.open_dataset(
    839                 store,
    840                 mask_and_scale=mask_and_scale,

~\AppData\Local\Continuum\anaconda3\envs\stuartshelf\lib\site-packages\xarray\backends\store.py in open_dataset(self, store, mask_and_scale, decode_times, concat_characters, decode_coords, drop_variables, use_cftime, decode_timedelta)
     22         decode_timedelta=None,
     23     ):
---> 24         vars, attrs = store.load()
     25         encoding = store.get_encoding()
     26 

~\AppData\Local\Continuum\anaconda3\envs\stuartshelf\lib\site-packages\xarray\backends\common.py in load(self)
    121         """
    122         variables = FrozenDict(
--> 123             (_decode_variable_name(k), v) for k, v in self.get_variables().items()
    124         )
    125         attributes = FrozenDict(self.get_attrs())

~\AppData\Local\Continuum\anaconda3\envs\stuartshelf\lib\site-packages\xarray\backends\zarr.py in get_variables(self)
    436 
    437     def get_variables(self):
--> 438         return FrozenDict(
    439             (k, self.open_store_variable(k, v)) for k, v in self.zarr_group.arrays()
    440         )

~\AppData\Local\Continuum\anaconda3\envs\stuartshelf\lib\site-packages\xarray\core\utils.py in FrozenDict(*args, **kwargs)
    474 
    475 def FrozenDict(*args, **kwargs) -> Frozen:
--> 476     return Frozen(dict(*args, **kwargs))
    477 
    478 

~\AppData\Local\Continuum\anaconda3\envs\stuartshelf\lib\site-packages\xarray\backends\zarr.py in <genexpr>(.0)
    436 
    437     def get_variables(self):
--> 438         return FrozenDict(
    439             (k, self.open_store_variable(k, v)) for k, v in self.zarr_group.arrays()
    440         )

~\AppData\Local\Continuum\anaconda3\envs\stuartshelf\lib\site-packages\zarr\hierarchy.py in _array_iter(self, keys_only, method, recurse)
    483             path = self._key_prefix + key
    484             if contains_array(self._store, path):
--> 485                 yield key if keys_only else (key, self[key])
    486             elif recurse and contains_group(self._store, path):
    487                 group = self[key]

~\AppData\Local\Continuum\anaconda3\envs\stuartshelf\lib\site-packages\zarr\hierarchy.py in __getitem__(self, item)
    339         path = self._item_path(item)
    340         if contains_array(self._store, path):
--> 341             return Array(self._store, read_only=self._read_only, path=path,
    342                          chunk_store=self._chunk_store,
    343                          synchronizer=self._synchronizer, cache_attrs=self.attrs.cache)

~\AppData\Local\Continuum\anaconda3\envs\stuartshelf\lib\site-packages\zarr\core.py in __init__(self, store, path, read_only, chunk_store, synchronizer, cache_metadata, cache_attrs, partial_decompress)
    157 
    158         # initialize metadata
--> 159         self._load_metadata()
    160 
    161         # initialize attributes

~\AppData\Local\Continuum\anaconda3\envs\stuartshelf\lib\site-packages\zarr\core.py in _load_metadata(self)
    174         """(Re)load metadata from store."""
    175         if self._synchronizer is None:
--> 176             self._load_metadata_nosync()
    177         else:
    178             mkey = self._key_prefix + array_meta_key

~\AppData\Local\Continuum\anaconda3\envs\stuartshelf\lib\site-packages\zarr\core.py in _load_metadata_nosync(self)
    203                 self._compressor = None
    204             else:
--> 205                 self._compressor = get_codec(config)
    206 
    207             # setup filters

~\AppData\Local\Continuum\anaconda3\envs\stuartshelf\lib\site-packages\numcodecs\registry.py in get_codec(config)
     31     cls = codec_registry.get(codec_id, None)
     32     if cls is None:
---> 33         raise ValueError('codec not available: %r' % codec_id)
     34     return cls.from_config(config)
     35 

ValueError: codec not available: 'imagecodecs_packbits'

1
fs = fsspec.filesystem('s3', **r_opts)
2

although the environment tells me it is there - maybe the environment has issues?

RichardScottOZ · 2022-05-13T09:10:13Z

Same thing, different environment, so presumably some sort of install issue.

RichardScottOZ · 2022-05-13T09:40:48Z

Just in case, but no luck

(base) C:\>conda activate kerchunktest

(kerchunktest) C:\>conda install -c conda-forge --force-reinstall imagecodecs
Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: C:\Users\rscott\AppData\Local\Continuum\anaconda3\envs\kerchunktest

  added / updated specs:
    - imagecodecs


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    blosc-1.21.1               |       h74325e0_2          49 KB  conda-forge
    c-blosc2-2.1.1             |       h09319c2_0         212 KB  conda-forge
    imagecodecs-2022.2.22      |   py38h6acac19_4         2.1 MB  conda-forge
    libavif-0.10.1             |       h8ffe710_0         2.3 MB  conda-forge
    ------------------------------------------------------------
                                           Total:         4.6 MB

The following NEW packages will be INSTALLED:

  aom                conda-forge/win-64::aom-3.3.0-h0e60522_1
  blosc              conda-forge/win-64::blosc-1.21.1-h74325e0_2
  c-blosc2           conda-forge/win-64::c-blosc2-2.1.1-h09319c2_0
  cfitsio            conda-forge/win-64::cfitsio-4.1.0-h5a969a9_0
  charls             conda-forge/win-64::charls-2.3.4-h39d44d4_0
  giflib             conda-forge/win-64::giflib-5.2.1-h8d14728_2
  imagecodecs        conda-forge/win-64::imagecodecs-2022.2.22-py38h6acac19_4
  jbig               conda-forge/win-64::jbig-2.1-h8d14728_2003
  jxrlib             conda-forge/win-64::jxrlib-1.1-h8ffe710_2
  lcms2              conda-forge/win-64::lcms2-2.12-h2a16943_0
  lerc               conda-forge/win-64::lerc-3.0-h0e60522_0
  libaec             conda-forge/win-64::libaec-1.0.6-h39d44d4_0
  libavif            conda-forge/win-64::libavif-0.10.1-h8ffe710_0
  libbrotlicommon    conda-forge/win-64::libbrotlicommon-1.0.9-h8ffe710_7
  libbrotlidec       conda-forge/win-64::libbrotlidec-1.0.9-h8ffe710_7
  libbrotlienc       conda-forge/win-64::libbrotlienc-1.0.9-h8ffe710_7
  libdeflate         conda-forge/win-64::libdeflate-1.10-h8ffe710_0
  libtiff            conda-forge/win-64::libtiff-4.3.0-hc4061b1_3
  libwebp-base       conda-forge/win-64::libwebp-base-1.2.2-h8ffe710_1
  libzopfli          conda-forge/win-64::libzopfli-1.0.3-h0e60522_0
  openjpeg           conda-forge/win-64::openjpeg-2.4.0-hb211442_1
  snappy             conda-forge/win-64::snappy-1.1.9-h82413e6_0
  zfp                conda-forge/win-64::zfp-0.5.5-h0e60522_8


Proceed ([y]/n)? y


Downloading and Extracting Packages
c-blosc2-2.1.1       | 212 KB    | ############################################################################ | 100%
imagecodecs-2022.2.2 | 2.1 MB    | ############################################################################ | 100%
libavif-0.10.1       | 2.3 MB    | ############################################################################ | 100%
blosc-1.21.1         | 49 KB     | ############################################################################ | 100%
Preparing transaction: done
Verifying transaction: done
Executing transaction: done

martindurant · 2022-05-13T13:11:26Z

Unless you have also updated numcodecs, you need to do

import imagecodecs.numcodecs
imagecodecs.numcodecs.register_codecs()

to make sure zarr knows about it.

RichardScottOZ · 2022-05-13T13:14:30Z

Thanks Martin! See how much further I can get tomorrow.

cgohlke · 2022-05-13T14:48:23Z

for i, filename in enumerate(tifffile.natural_sorted(filetest[0:2])):

_close=i == len(files) - 1, # if True, no more appends

how should the file close?

Set _close to True only for the last file.

"chunks\": [\n 1,\n 1,\n 5087\n ]

That's a very inefficient chunk. Probably better to treat individual files as chunks:

tifffile/tifffile/tifffile.py

Lines 655 to 684 in 375d97f

    
           Read an image stack from a series of TIFF files with a file name pattern 
        
           as numpy or zarr arrays: 
        
           >>> image_sequence = TiffSequence('temp_C0*.tif', pattern=r'_(C)(\d+)(T)(\d+)') 
        
           >>> image_sequence.shape 
        
           (1, 2) 
        
           >>> image_sequence.axes 
        
           'CT' 
        
           >>> data = image_sequence.asarray() 
        
           >>> data.shape 
        
           (1, 2, 64, 64) 
        
           >>> with image_sequence.aszarr() as store: 
        
           ...     zarr.open(store, mode='r') 
        
           <zarr.core.Array (1, 2, 64, 64) float64 read-only> 
        
           >>> image_sequence.close() 
        
           Write the zarr store to a fsspec ReferenceFileSystem in JSON format: 
        
           >>> with image_sequence.aszarr() as store: 
        
           ...     store.write_fsspec('temp.json', url='file://') 
        
           Open the fsspec ReferenceFileSystem as a zarr array: 
        
           >>> import fsspec 
        
           >>> import tifffile.numcodecs 
        
           >>> tifffile.numcodecs.register_codec() 
        
           >>> mapper = fsspec.get_mapper( 
        
           ...     'reference://', fo='temp.json', target_protocol='file') 
        
           >>> zarr.open(mapper, mode='r') 
        
           <zarr.core.Array (1, 2, 64, 64) float64 read-only>

In that case, using the TIFF decoder from the imagecodecs package is more efficient. See the earthbigdata.py

RichardScottOZ · 2022-05-13T21:42:46Z

Thanks for that.

RichardScottOZ · 2022-05-14T00:03:58Z

OK, got this far

chunkshape = (testraster.shape[1],testraster.shape[2])
print("chunkshape:", chunkshape)
fillvalue = -1000

latitude_label = 'latitude'
latitude_coordinates = [y for y in testraster.y.values]
              
longitude_label = 'longitude'
longitude_coordinates = [x for x in testraster.x.values]

coordinates = {}
zarrgroup = zarr.open_group(coordinates)
zarrgroup.array(
    longitude_label, data=longitude_coordinates, dtype='float32'
).attrs['_ARRAY_DIMENSIONS'] = [longitude_label]

zarrgroup.array(
    latitude_label, data=latitude_coordinates, dtype='float32'
).attrs['_ARRAY_DIMENSIONS'] = [latitude_label]


coordinates_json = tifffile.ZarrStore._json(str(coordinates)).decode()

jsonfile.write(coordinates_json[:-2])  # skip the last newline and brace

#mode = 'test'  not needed here
fileseq = tifffile.TiffSequence(
    [file for file in pMine_files],
)


store = fileseq.aszarr(
    dtype='float32',
    chunkshape=chunkshape,
    fillvalue=fillvalue,
    axestiled={0: 0},
    zattrs={
        '_ARRAY_DIMENSIONS': [
            latitude_label,
            longitude_label,
        ]
    },
)
print(store)

store.write_fsspec(
    jsonfile,
    baseurl,
    groupname=mode,
    codec_id='imagecodecs_tiff',
    _append=True,
    _close=mode == 'lsmap',  # close after last store
)

jsonfile.write('\n}')
jsonfile.close()

chunkshape: (4710, 5087)
ZarrFileSequenceStore
 shape: 9420, 5087
 chunks: 4710, 5087
 dtype: float32
 fillvalue: -1000

RichardScottOZ · 2022-05-14T00:14:25Z

Which seems good, a json formatting error now, which I bypassed with a string conversion in the interim to check the above works.

however, returning coordinates_json = tifffile.ZarrStore._json(coordinates).decode()

gives a serialisation error

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Input In [70], in <cell line: 26>()
     17 zarrgroup.array(
     18     longitude_label, data=longitude_coordinates, dtype='float32'
     19 ).attrs['_ARRAY_DIMENSIONS'] = [longitude_label]
     21 zarrgroup.array(
     22     latitude_label, data=latitude_coordinates, dtype='float32'
     23 ).attrs['_ARRAY_DIMENSIONS'] = [latitude_label]
---> 26 coordinates_json = tifffile.ZarrStore._json(coordinates).decode()
     28 jsonfile.write(coordinates_json[:-2])  # skip the last newline and brace
     30 #mode = 'test'  not needed here

File ~\AppData\Local\Continuum\anaconda3\envs\kerchunktest\lib\site-packages\tifffile\tifffile.py:10256, in ZarrStore._json(obj)
  10253 @staticmethod
  10254 def _json(obj: Any, /) -> bytes:
  10255     """Serialize obj to a JSON formatted string."""
> 10256     return json.dumps(
  10257         obj,
  10258         indent=1,
  10259         sort_keys=True,
  10260         ensure_ascii=True,
  10261         separators=(',', ': '),
  10262     ).encode('ascii')

File ~\AppData\Local\Continuum\anaconda3\envs\kerchunktest\lib\json\__init__.py:234, in dumps(obj, skipkeys, ensure_ascii, check_circular, allow_nan, cls, indent, separators, default, sort_keys, **kw)
    232 if cls is None:
    233     cls = JSONEncoder
--> 234 return cls(
    235     skipkeys=skipkeys, ensure_ascii=ensure_ascii,
    236     check_circular=check_circular, allow_nan=allow_nan, indent=indent,
    237     separators=separators, default=default, sort_keys=sort_keys,
    238     **kw).encode(obj)

File ~\AppData\Local\Continuum\anaconda3\envs\kerchunktest\lib\json\encoder.py:201, in JSONEncoder.encode(self, o)
    199 chunks = self.iterencode(o, _one_shot=True)
    200 if not isinstance(chunks, (list, tuple)):
--> 201     chunks = list(chunks)
    202 return ''.join(chunks)

File ~\AppData\Local\Continuum\anaconda3\envs\kerchunktest\lib\json\encoder.py:431, in _make_iterencode.<locals>._iterencode(o, _current_indent_level)
    429     yield from _iterencode_list(o, _current_indent_level)
    430 elif isinstance(o, dict):
--> 431     yield from _iterencode_dict(o, _current_indent_level)
    432 else:
    433     if markers is not None:

File ~\AppData\Local\Continuum\anaconda3\envs\kerchunktest\lib\json\encoder.py:405, in _make_iterencode.<locals>._iterencode_dict(dct, _current_indent_level)
    403         else:
    404             chunks = _iterencode(value, _current_indent_level)
--> 405         yield from chunks
    406 if newline_indent is not None:
    407     _current_indent_level -= 1

File ~\AppData\Local\Continuum\anaconda3\envs\kerchunktest\lib\json\encoder.py:438, in _make_iterencode.<locals>._iterencode(o, _current_indent_level)
    436         raise ValueError("Circular reference detected")
    437     markers[markerid] = o
--> 438 o = _default(o)
    439 yield from _iterencode(o, _current_indent_level)
    440 if markers is not None:

File ~\AppData\Local\Continuum\anaconda3\envs\kerchunktest\lib\json\encoder.py:179, in JSONEncoder.default(self, o)
    160 def default(self, o):
    161     """Implement this method in a subclass such that it returns
    162     a serializable object for ``o``, or calls the base implementation
    163     (to raise a ``TypeError``).
   (...)
    177 
    178     """
--> 179     raise TypeError(f'Object of type {o.__class__.__name__} '
    180                     f'is not JSON serializable')

TypeError: Object of type bytes is not JSON serializable

cgohlke · 2022-05-14T00:23:31Z

tifffile/examples/earthbigdata.py

Lines 242 to 249 in 375d97f

    
           # base64 encode any values containing non-ascii characters 
        
           for k, v in coordinates.items(): 
        
               try: 
        
                   coordinates[k] = v.decode() 
        
               except UnicodeDecodeError: 
        
                   coordinates[k] = 'base64:' + base64.b64encode(v).decode() 
        
           coordinates_json = tifffile.ZarrStore._json(coordinates).decode()

RichardScottOZ · 2022-05-14T00:26:31Z

ah, thanks - missed that bit then, probably assuming it was ok.

RichardScottOZ · 2022-05-14T01:12:15Z

testrasters are this

I get this error whether I include the band coordinate or not

".zgroup": "{\n    \"zarr_format\": 2\n}",
 "band/.zarray": "{\n    \"chunks\": [\n        1\n    ],\n    \"compressor\": {\n        \"blocksize\": 0,\n        \"clevel\": 5,\n        \"cname\": \"lz4\",\n        \"id\": \"blosc\",\n        \"shuffle\": 1\n    },\n    \"dtype\": \"<i4\",\n    \"fill_value\": 0,\n    \"filters\": null,\n    \"order\": \"C\",\n    \"shape\": [\n        1\n    ],\n    \"zarr_format\": 2\n}",
 "band/.zattrs": "{\n    \"_ARRAY_DIMENSIONS\": [\n        \"band\"\n    ]\n}",
 "band/0": "\u0002\u00013\u0004\u0004\u0000\u0000\u0000\u0004\u0000\u0000\u0000\u0014\u0000\u0000\u0000\u0001\u0000\u0000\u0000",
 "latitude/.zarray": "{\n    \"chunks\": [\n        4710\n    ],\n    \"compressor\": {\n        \"blocksize\": 0,\n        \"clevel\": 5,\n        \"cname\": \"lz4\",\n        \"id\": \"blosc\",\n        \"shuffle\": 1\n    },\n    \"dtype\": \"<f4\",\n    \"fill_value\": 0.0,\n    \"filters\": null,\n    \"order\": \"C\",\n    \"shape\": [\n        4710\n    ],\n    \"zarr_format\": 2\n}",
 "latitude/.zattrs": "{\n    \"_ARRAY_DIMENSIONS\": [\n        \"latitude\"\n    ]\n}",
 "latitude/0": 
...
"test/.zattrs": "{\n \"_ARRAY_DIMENSIONS\": [\n  \"band\",\n  \"latitude\",\n  \"longitude\"\n ]\n}",
 "test/.zarray": "{\n \"chunks\": [\n  1,\n  1,\n  4710,\n  5087\n ],\n \"compressor\": {\n  \"id\": \"imagecodecs_tiff\"\n },\n \"dtype\": \"<f4\",\n \"fill_value\": -1000.0,\n \"filters\": null,\n \"order\": \"C\",\n \"shape\": [\n  2,\n  1,\n  4710,\n  5087\n ],\n \"zarr_format\": 2\n}",
 "test/0.0.0.0": ["https://generated/results-20201205/testrasters.tif"],
 "test/1.0.0.0": ["https://generated/results-20201206/testrasters.tif"]
}
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Input In [127], in <cell line: 9>()
      1 mapper = fsspec.get_mapper(
      2     'reference://',
      3     fo='earthbigdata.json',
   (...)
      6     remote_options=r_opts,
      7 )
----> 9 dataset = xarray.open_dataset(
     10     mapper,
     11     engine='zarr',
     12     mask_and_scale=False,
     13     backend_kwargs={'consolidated': False},
     14 )
     15 print(dataset)

File ~\AppData\Local\Continuum\anaconda3\envs\kerchunktest\lib\site-packages\xarray\backends\api.py:495, in open_dataset(filename_or_obj, engine, chunks, cache, decode_cf, mask_and_scale, decode_times, decode_timedelta, use_cftime, concat_characters, decode_coords, drop_variables, backend_kwargs, *args, **kwargs)
    483 decoders = _resolve_decoders_kwargs(
    484     decode_cf,
    485     open_backend_dataset_parameters=backend.open_dataset_parameters,
   (...)
    491     decode_coords=decode_coords,
    492 )
    494 overwrite_encoded_chunks = kwargs.pop("overwrite_encoded_chunks", None)
--> 495 backend_ds = backend.open_dataset(
    496     filename_or_obj,
    497     drop_variables=drop_variables,
    498     **decoders,
    499     **kwargs,
    500 )
    501 ds = _dataset_from_backend_dataset(
    502     backend_ds,
    503     filename_or_obj,
   (...)
    510     **kwargs,
    511 )
    512 return ds

File ~\AppData\Local\Continuum\anaconda3\envs\kerchunktest\lib\site-packages\xarray\backends\zarr.py:814, in ZarrBackendEntrypoint.open_dataset(self, filename_or_obj, mask_and_scale, decode_times, concat_characters, decode_coords, drop_variables, use_cftime, decode_timedelta, group, mode, synchronizer, consolidated, chunk_store, storage_options, stacklevel)
    812 store_entrypoint = StoreBackendEntrypoint()
    813 with close_on_error(store):
--> 814     ds = store_entrypoint.open_dataset(
    815         store,
    816         mask_and_scale=mask_and_scale,
    817         decode_times=decode_times,
    818         concat_characters=concat_characters,
    819         decode_coords=decode_coords,
    820         drop_variables=drop_variables,
    821         use_cftime=use_cftime,
    822         decode_timedelta=decode_timedelta,
    823     )
    824 return ds

File ~\AppData\Local\Continuum\anaconda3\envs\kerchunktest\lib\site-packages\xarray\backends\store.py:24, in StoreBackendEntrypoint.open_dataset(self, store, mask_and_scale, decode_times, concat_characters, decode_coords, drop_variables, use_cftime, decode_timedelta)
     12 def open_dataset(
     13     self,
     14     store,
   (...)
     22     decode_timedelta=None,
     23 ):
---> 24     vars, attrs = store.load()
     25     encoding = store.get_encoding()
     27     vars, attrs, coord_names = conventions.decode_cf_variables(
     28         vars,
     29         attrs,
   (...)
     36         decode_timedelta=decode_timedelta,
     37     )

File ~\AppData\Local\Continuum\anaconda3\envs\kerchunktest\lib\site-packages\xarray\backends\common.py:123, in AbstractDataStore.load(self)
    101 def load(self):
    102     """
    103     This loads the variables and attributes simultaneously.
    104     A centralized loading function makes it easier to create
   (...)
    120     are requested, so care should be taken to make sure its fast.
    121     """
    122     variables = FrozenDict(
--> 123         (_decode_variable_name(k), v) for k, v in self.get_variables().items()
    124     )
    125     attributes = FrozenDict(self.get_attrs())
    126     return variables, attributes

File ~\AppData\Local\Continuum\anaconda3\envs\kerchunktest\lib\site-packages\xarray\backends\zarr.py:422, in ZarrStore.get_variables(self)
    421 def get_variables(self):
--> 422     return FrozenDict(
    423         (k, self.open_store_variable(k, v)) for k, v in self.zarr_group.arrays()
    424     )

File ~\AppData\Local\Continuum\anaconda3\envs\kerchunktest\lib\site-packages\xarray\core\utils.py:473, in FrozenDict(*args, **kwargs)
    472 def FrozenDict(*args, **kwargs) -> Frozen:
--> 473     return Frozen(dict(*args, **kwargs))

File ~\AppData\Local\Continuum\anaconda3\envs\kerchunktest\lib\site-packages\xarray\backends\zarr.py:423, in <genexpr>(.0)
    421 def get_variables(self):
    422     return FrozenDict(
--> 423         (k, self.open_store_variable(k, v)) for k, v in self.zarr_group.arrays()
    424     )

File ~\AppData\Local\Continuum\anaconda3\envs\kerchunktest\lib\site-packages\xarray\backends\zarr.py:419, in ZarrStore.open_store_variable(self, name, zarr_array)
    416 if getattr(zarr_array, "fill_value") is not None:
    417     attributes["_FillValue"] = zarr_array.fill_value
--> 419 return Variable(dimensions, data, attributes, encoding)

File ~\AppData\Local\Continuum\anaconda3\envs\kerchunktest\lib\site-packages\xarray\core\variable.py:305, in Variable.__init__(self, dims, data, attrs, encoding, fastpath)
    285 """
    286 Parameters
    287 ----------
   (...)
    302     unrecognized encoding items.
    303 """
    304 self._data = as_compatible_data(data, fastpath=fastpath)
--> 305 self._dims = self._parse_dimensions(dims)
    306 self._attrs = None
    307 self._encoding = None

File ~\AppData\Local\Continuum\anaconda3\envs\kerchunktest\lib\site-packages\xarray\core\variable.py:573, in Variable._parse_dimensions(self, dims)
    571 dims = tuple(dims)
    572 if len(dims) != self.ndim:
--> 573     raise ValueError(
    574         f"dimensions {dims} must have the same length as the "
    575         f"number of data dimensions, ndim={self.ndim}"
    576     )
    577 return dims

ValueError: dimensions ('latitude', 'longitude') must have the same length as the number of data dimensions, ndim=3

RichardScottOZ · 2022-05-14T03:36:40Z

Ok, looks like it was keeping the first try - trialling this is a notebook, so when I reran was more up to date.

RichardScottOZ · 2022-05-14T04:06:46Z

Trying with something useful

So I presumably have something dimensionally wrong somewhere - or a protocol?

As it should look like Australia

This is just plotting a copy of the first one of these I have locally

the same via rioxarray directly from s3

with the same creds

RichardScottOZ · 2022-05-14T04:08:38Z

It seems I had to add another dimension to match and have it work, which isn't in Christoph's earthbigdata script, but haven't worked out what I have done wrong yet. e.g. the first map is just all the fill value of -1000 it looks like.

RichardScottOZ · 2022-05-14T04:32:42Z

Ok, looks like autogenerated urls in the zarr are https, changed my incorrect base url to s3 and all good! So protocol it was.

RichardScottOZ · 2022-05-14T05:39:46Z

So, those are small lores files. How would you approach chunking generic geotiffs this way? I had been turning things into xarray zarr, which had useful defaults that worked ok.

RichardScottOZ · 2022-05-14T06:41:34Z

for i, filename in enumerate(tifffile.natural_sorted(filetest[0:2])):

_close=i == len(files) - 1, # if True, no more appends

how should the file close?

Set _close to True only for the last file.

"chunks\": [\n 1,\n 1,\n 5087\n ]

That's a very inefficient chunk. Probably better to treat individual files as chunks:

tifffile/tifffile/tifffile.py

Lines 655 to 684 in 375d97f

Read an image stack from a series of TIFF files with a file name pattern

as numpy or zarr arrays:

>>> image_sequence = TiffSequence('temp_C0*.tif', pattern=r'_(C)(\d+)(T)(\d+)')

>>> image_sequence.shape

(1, 2)

>>> image_sequence.axes

'CT'

>>> data = image_sequence.asarray()

>>> data.shape

(1, 2, 64, 64)

>>> with image_sequence.aszarr() as store:

... zarr.open(store, mode='r')

<zarr.core.Array (1, 2, 64, 64) float64 read-only>

>>> image_sequence.close()

Write the zarr store to a fsspec ReferenceFileSystem in JSON format:

>>> with image_sequence.aszarr() as store:

... store.write_fsspec('temp.json', url='file://')

Open the fsspec ReferenceFileSystem as a zarr array:

>>> import fsspec

>>> import tifffile.numcodecs

>>> tifffile.numcodecs.register_codec()

>>> mapper = fsspec.get_mapper(

... 'reference://', fo='temp.json', target_protocol='file')

>>> zarr.open(mapper, mode='r')

<zarr.core.Array (1, 2, 64, 64) float64 read-only>

In that case, using the TIFF decoder from the imagecodecs package is more efficient. See the earthbigdata.py

Have to decide what to do for a 40gb geotiff with 15 bands for example, chunkwise.

RichardScottOZ · 2022-05-15T01:53:26Z

The 125 example is fine though.

although in this case will need to post-add the code to flip the map the right way round when plotting.

RichardScottOZ · 2022-05-15T01:54:27Z

Or add coords?

RichardScottOZ · 2022-05-16T02:15:58Z

So next question is approaching something like this:-

<xarray.Dataset>
Dimensions:    (band: 23, model_run: 1, y: 4710, x: 5087)
Coordinates:
  * band       (band) int32 1 2 3 4 5 6 7 8 9 10 ... 15 16 17 18 19 20 21 22 23
  * model_run  (model_run) int32 20201205
  * x          (x) float32 113.0 113.0 113.0 113.0 ... 154.0 154.0 154.0 154.0
  * y          (y) float32 -10.0 -10.01 -10.02 -10.03 ... -43.98 -43.99 -44.0
Data variables:
    pMine      (model_run, band, y, x) float32 ...

e.g. 23 different model varieties in one - this is too big for a 'one file' chunk'. Depending how big they are, strip sized chunks might be ok [if it is 50GB], but probably not at the 2GB level, e.g. one Sentinel band or the file example above.

RichardScottOZ · 2022-05-16T02:25:57Z

Not a particularly useful thing to do, but a trial :-

chunkshape: (23, 4710, 5087)
['20201205']
ZarrFileSequenceStore
 shape: 1, 23, 4710, 5087
 chunks: 1, 23, 4710, 5087
 dtype: float32
 fillvalue: -1000

dataset['pMine'].max()

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Input In [11], in <cell line: 1>()
----> 1 dataset['pMine'].max()

File ~\AppData\Local\Continuum\anaconda3\envs\kerchunktest\lib\site-packages\xarray\core\common.py:58, in ImplementsArrayReduce._reduce_method.<locals>.wrapped_func(self, dim, axis, skipna, **kwargs)
     57 def wrapped_func(self, dim=None, axis=None, skipna=None, **kwargs):
---> 58     return self.reduce(func, dim, axis, skipna=skipna, **kwargs)

File ~\AppData\Local\Continuum\anaconda3\envs\kerchunktest\lib\site-packages\xarray\core\dataarray.py:2696, in DataArray.reduce(self, func, dim, axis, keep_attrs, keepdims, **kwargs)
   2654 def reduce(
   2655     self,
   2656     func: Callable[..., Any],
   (...)
   2661     **kwargs: Any,
   2662 ) -> DataArray:
   2663     """Reduce this array by applying `func` along some dimension(s).
   2664 
   2665     Parameters
   (...)
   2693         summarized data and the indicated dimension(s) removed.
   2694     """
-> 2696     var = self.variable.reduce(func, dim, axis, keep_attrs, keepdims, **kwargs)
   2697     return self._replace_maybe_drop_dims(var)

File ~\AppData\Local\Continuum\anaconda3\envs\kerchunktest\lib\site-packages\xarray\core\variable.py:1806, in Variable.reduce(self, func, dim, axis, keep_attrs, keepdims, **kwargs)
   1804         data = func(self.data, axis=axis, **kwargs)
   1805     else:
-> 1806         data = func(self.data, **kwargs)
   1808 if getattr(data, "shape", ()) == self.shape:
   1809     dims = self.dims

File ~\AppData\Local\Continuum\anaconda3\envs\kerchunktest\lib\site-packages\xarray\core\variable.py:339, in Variable.data(self)
    337     return self._data
    338 else:
--> 339     return self.values

File ~\AppData\Local\Continuum\anaconda3\envs\kerchunktest\lib\site-packages\xarray\core\variable.py:512, in Variable.values(self)
    509 @property
    510 def values(self):
    511     """The variable's data as a numpy.ndarray"""
--> 512     return _as_array_or_item(self._data)

File ~\AppData\Local\Continuum\anaconda3\envs\kerchunktest\lib\site-packages\xarray\core\variable.py:252, in _as_array_or_item(data)
    238 def _as_array_or_item(data):
    239     """Return the given values as a numpy array, or as an individual item if
    240     it's a 0d datetime64 or timedelta64 array.
    241 
   (...)
    250     TODO: remove this (replace with np.asarray) once these issues are fixed
    251     """
--> 252     data = np.asarray(data)
    253     if data.ndim == 0:
    254         if data.dtype.kind == "M":

File ~\AppData\Local\Continuum\anaconda3\envs\kerchunktest\lib\site-packages\xarray\core\indexing.py:552, in MemoryCachedArray.__array__(self, dtype)
    551 def __array__(self, dtype=None):
--> 552     self._ensure_cached()
    553     return np.asarray(self.array, dtype=dtype)

File ~\AppData\Local\Continuum\anaconda3\envs\kerchunktest\lib\site-packages\xarray\core\indexing.py:549, in MemoryCachedArray._ensure_cached(self)
    547 def _ensure_cached(self):
    548     if not isinstance(self.array, NumpyIndexingAdapter):
--> 549         self.array = NumpyIndexingAdapter(np.asarray(self.array))

File ~\AppData\Local\Continuum\anaconda3\envs\kerchunktest\lib\site-packages\xarray\core\indexing.py:522, in CopyOnWriteArray.__array__(self, dtype)
    521 def __array__(self, dtype=None):
--> 522     return np.asarray(self.array, dtype=dtype)

File ~\AppData\Local\Continuum\anaconda3\envs\kerchunktest\lib\site-packages\xarray\core\indexing.py:423, in LazilyIndexedArray.__array__(self, dtype)
    421 def __array__(self, dtype=None):
    422     array = as_indexable(self.array)
--> 423     return np.asarray(array[self.key], dtype=None)

File ~\AppData\Local\Continuum\anaconda3\envs\kerchunktest\lib\site-packages\xarray\backends\zarr.py:73, in ZarrArrayWrapper.__getitem__(self, key)
     71 array = self.get_array()
     72 if isinstance(key, indexing.BasicIndexer):
---> 73     return array[key.tuple]
     74 elif isinstance(key, indexing.VectorizedIndexer):
     75     return array.vindex[
     76         indexing._arrayize_vectorized_indexer(key, self.shape).tuple
     77     ]

File ~\AppData\Local\Continuum\anaconda3\envs\kerchunktest\lib\site-packages\zarr\core.py:720, in Array.__getitem__(self, selection)
    718     result = self.vindex[selection]
    719 else:
--> 720     result = self.get_basic_selection(pure_selection, fields=fields)
    721 return result

File ~\AppData\Local\Continuum\anaconda3\envs\kerchunktest\lib\site-packages\zarr\core.py:846, in Array.get_basic_selection(self, selection, out, fields)
    843     return self._get_basic_selection_zd(selection=selection, out=out,
    844                                         fields=fields)
    845 else:
--> 846     return self._get_basic_selection_nd(selection=selection, out=out,
    847                                         fields=fields)

File ~\AppData\Local\Continuum\anaconda3\envs\kerchunktest\lib\site-packages\zarr\core.py:889, in Array._get_basic_selection_nd(self, selection, out, fields)
    883 def _get_basic_selection_nd(self, selection, out=None, fields=None):
    884     # implementation of basic selection for array with at least one dimension
    885 
    886     # setup indexer
    887     indexer = BasicIndexer(selection, self)
--> 889     return self._get_selection(indexer=indexer, out=out, fields=fields)

File ~\AppData\Local\Continuum\anaconda3\envs\kerchunktest\lib\site-packages\zarr\core.py:1179, in Array._get_selection(self, indexer, out, fields)
   1173 if not hasattr(self.chunk_store, "getitems") or \
   1174    any(map(lambda x: x == 0, self.shape)):
   1175     # sequentially get one key at a time from storage
   1176     for chunk_coords, chunk_selection, out_selection in indexer:
   1177 
   1178         # load chunk selection into output array
-> 1179         self._chunk_getitem(chunk_coords, chunk_selection, out, out_selection,
   1180                             drop_axes=indexer.drop_axes, fields=fields)
   1181 else:
   1182     # allow storage to get multiple items at once
   1183     lchunk_coords, lchunk_selection, lout_selection = zip(*indexer)

File ~\AppData\Local\Continuum\anaconda3\envs\kerchunktest\lib\site-packages\zarr\core.py:1883, in Array._chunk_getitem(self, chunk_coords, chunk_selection, out, out_selection, drop_axes, fields)
   1880         out[out_selection] = fill_value
   1882 else:
-> 1883     self._process_chunk(out, cdata, chunk_selection, drop_axes,
   1884                         out_is_ndarray, fields, out_selection)

File ~\AppData\Local\Continuum\anaconda3\envs\kerchunktest\lib\site-packages\zarr\core.py:1791, in Array._process_chunk(self, out, cdata, chunk_selection, drop_axes, out_is_ndarray, fields, out_selection, partial_read_decode)
   1789     if isinstance(cdata, PartialReadBuffer):
   1790         cdata = cdata.read_full()
-> 1791     self._compressor.decode(cdata, dest)
   1792 else:
   1793     chunk = ensure_ndarray(cdata).view(self._dtype)

File ~\AppData\Local\Continuum\anaconda3\envs\kerchunktest\lib\site-packages\imagecodecs\numcodecs.py:812, in Tiff.decode(self, buf, out)
    811 def decode(self, buf, out=None):
--> 812     return imagecodecs.tiff_decode(
    813         buf,
    814         index=self.index,
    815         asrgb=self.asrgb,
    816         verbose=self.verbose,
    817         out=out,
    818     )

File imagecodecs\_tiff.pyx:327, in imagecodecs._tiff.tiff_decode()

File imagecodecs\_shared.pyx:164, in imagecodecs._shared._create_array()

ValueError: invalid output shape (1, 23, 4710, 5087) (4710, 5087, 23)

cgohlke · 2023-01-23T20:24:59Z

Looks like the dataset was removed:

>>> import fsspec
>>> fs = fsspec.filesystem('s3', anon=True, client_kwargs={'endpoint_url': 'https://mghp.osn.xsede.org'})
>>> fs.ls('/rsignellbucket1/lcmap/cog')
[]

Edit: Never mind. Works for me now.

rsignell-usgs · 2023-01-26T20:54:36Z

@cgohlke , indeed every once in a while it seems there is a glitch on the open storage network. :(

rsignell-usgs changed the title ~~Creating a virtual zarr datacube from s3 objects~~ Creating a virtual zarr datacube from COG s3 objects Apr 5, 2022

cgohlke added enhancement New feature or request question Further information is requested labels Apr 5, 2022

cgohlke closed this as completed Apr 8, 2022

martindurant mentioned this issue Apr 11, 2022

TIFF handling fsspec/kerchunk#146

Open

weiji14 mentioned this issue Nov 26, 2023

LightningDataModule to load GeoTIFF files Clay-foundation/model#52

Merged

6 tasks

agrouaze mentioned this issue Apr 22, 2024

accessing Sentinel-1 SLC tiff measurement through S3 procotol #251

Closed

Creating a virtual zarr datacube from COG s3 objects #125

Creating a virtual zarr datacube from COG s3 objects #125

Comments

rsignell-usgs commented Apr 5, 2022 • edited Loading

martindurant commented Apr 5, 2022

cgohlke commented Apr 5, 2022

martindurant commented Apr 6, 2022

cgohlke commented Apr 6, 2022

martindurant commented Apr 6, 2022

cgohlke commented Apr 6, 2022

martindurant commented Apr 6, 2022

cgohlke commented Apr 6, 2022

martindurant commented Apr 7, 2022

cgohlke commented Apr 7, 2022 • edited Loading

cgohlke commented Apr 8, 2022

rsignell-usgs commented Apr 8, 2022 • edited Loading

cgohlke commented Apr 8, 2022

rsignell-usgs commented Apr 8, 2022

martindurant commented Apr 8, 2022

RichardScottOZ commented May 13, 2022

RichardScottOZ commented May 13, 2022

cgohlke commented May 13, 2022

RichardScottOZ commented May 13, 2022

RichardScottOZ commented May 13, 2022

RichardScottOZ commented May 13, 2022

RichardScottOZ commented May 13, 2022 • edited Loading

RichardScottOZ commented May 13, 2022 • edited Loading

RichardScottOZ commented May 13, 2022

RichardScottOZ commented May 13, 2022

RichardScottOZ commented May 13, 2022 • edited Loading

RichardScottOZ commented May 13, 2022

RichardScottOZ commented May 13, 2022

RichardScottOZ commented May 13, 2022

RichardScottOZ commented May 13, 2022

RichardScottOZ commented May 13, 2022

RichardScottOZ commented May 13, 2022

martindurant commented May 13, 2022

RichardScottOZ commented May 13, 2022

cgohlke commented May 13, 2022

RichardScottOZ commented May 13, 2022

RichardScottOZ commented May 14, 2022 • edited Loading

RichardScottOZ commented May 14, 2022

cgohlke commented May 14, 2022

RichardScottOZ commented May 14, 2022

RichardScottOZ commented May 14, 2022 • edited Loading

RichardScottOZ commented May 14, 2022

RichardScottOZ commented May 14, 2022

RichardScottOZ commented May 14, 2022

RichardScottOZ commented May 14, 2022 • edited Loading

RichardScottOZ commented May 14, 2022

RichardScottOZ commented May 14, 2022

RichardScottOZ commented May 15, 2022

RichardScottOZ commented May 15, 2022

RichardScottOZ commented May 16, 2022

RichardScottOZ commented May 16, 2022

cgohlke commented Jan 23, 2023 • edited Loading

rsignell-usgs commented Jan 26, 2023

rsignell-usgs commented Apr 5, 2022 •

edited

Loading

cgohlke commented Apr 7, 2022 •

edited

Loading

rsignell-usgs commented Apr 8, 2022 •

edited

Loading

RichardScottOZ commented May 13, 2022 •

edited

Loading

RichardScottOZ commented May 13, 2022 •

edited

Loading

RichardScottOZ commented May 13, 2022 •

edited

Loading

RichardScottOZ commented May 14, 2022 •

edited

Loading

RichardScottOZ commented May 14, 2022 •

edited

Loading

RichardScottOZ commented May 14, 2022 •

edited

Loading

cgohlke commented Jan 23, 2023 •

edited

Loading