Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getitems: support meta_array #1131

Merged
merged 33 commits into from
Apr 13, 2023
Merged
Show file tree
Hide file tree
Changes from 10 commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
c268c63
Use _chunk_getitems() always
madsbk Sep 12, 2022
c7023f9
Implement getitems() always
madsbk Sep 12, 2022
89fa599
FSStore.getitems(): accept meta_array and on_error
madsbk Sep 12, 2022
f05ee3a
getitems(): handle on_error="omit"
madsbk Sep 12, 2022
0eed377
Removed the `on_error argument`
madsbk Sep 13, 2022
e578051
remove redundant check
madsbk Sep 13, 2022
03c97a8
getitems(): use Sequence instead of Iterable
madsbk Sep 13, 2022
8ba463c
Merge branch 'main' of github.com:zarr-developers/zarr-python into ge…
madsbk Sep 15, 2022
dfa731b
Merge branch 'main' into getitems
madsbk Sep 22, 2022
8e753d6
Merge branch 'main' into getitems
jakirkham Sep 23, 2022
f05edb1
Merge branch 'main' into getitems
jakirkham Sep 28, 2022
1d2b6ea
Typo
madsbk Oct 7, 2022
f87cb60
Merge branch 'main' into getitems
jakirkham Oct 7, 2022
eea7466
Merge branch 'main' of github.com:zarr-developers/zarr-python into ge…
madsbk Oct 11, 2022
05be1d4
Introduce a contexts argument
madsbk Oct 12, 2022
af54f7e
Merge branch 'main' of github.com:zarr-developers/zarr-python into ge…
madsbk Oct 12, 2022
5513d6f
CountingDict: impl. getitems()
madsbk Oct 12, 2022
ad46ccc
Merge branch 'main' of github.com:zarr-developers/zarr-python into ge…
madsbk Oct 12, 2022
40ecba0
added test_getitems()
madsbk Oct 12, 2022
0f224d4
Merge branch 'main' of github.com:zarr-developers/zarr-python into ge…
madsbk Oct 24, 2022
61d7a03
Merge branch 'main' of github.com:zarr-developers/zarr-python into ge…
madsbk Nov 4, 2022
81549f5
Introduce Context
madsbk Nov 4, 2022
2bfe68a
doc
madsbk Nov 4, 2022
4e9c39e
Merge branch 'main' of github.com:zarr-developers/zarr-python into ge…
madsbk Dec 6, 2022
a250f64
Merge branch 'main' of github.com:zarr-developers/zarr-python into ge…
madsbk Mar 13, 2023
02fc80d
support the new get_partial_values() method
madsbk Mar 13, 2023
61981f2
Merge branch 'main' of github.com:zarr-developers/zarr-python into ge…
madsbk Mar 13, 2023
d0afcde
Resolve conflict with get_partial_values()
madsbk Mar 13, 2023
7e01831
Merge branch 'main' of github.com:zarr-developers/zarr-python into ge…
madsbk Apr 11, 2023
d9838ef
make contexts keyword-only
madsbk Apr 11, 2023
c3ee95f
Introduce ConstantMap
madsbk Apr 11, 2023
ec5f396
use typing.Mapping
madsbk Apr 11, 2023
a1d3520
test_constant_map
madsbk Apr 11, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 29 additions & 1 deletion zarr/_storage/store.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,9 @@
import os
from collections.abc import MutableMapping
from string import ascii_letters, digits
from typing import Any, List, Mapping, Optional, Union
from typing import Any, Sequence, List, Mapping, Optional, Union

from numcodecs.ndarray_like import NDArrayLike

from zarr.meta import Metadata2, Metadata3
from zarr.util import normalize_storage_path
Expand Down Expand Up @@ -129,6 +131,32 @@ def _ensure_store(store: Any):
f"wrap it in Zarr.storage.KVStore. Got {store}"
)

def getitems(self, keys: Sequence[str], meta_array: NDArrayLike) -> Mapping[str, Any]:
"""Retrieve data from multiple keys.

Parameters
----------
keys : Iterable[str]
The keys to retrieve
meta_array : array-like
An array instance to use for determining the output type. For now, this is
only a hint and can be ignore by the implementation, in which case the type
madsbk marked this conversation as resolved.
Show resolved Hide resolved
of the output is the same as calling __getitem__() for each key in keys.

Returns
-------
Mapping
A collection mapping the input keys to their results.

Developer Notes
---------------
This default implementation use __getitem__() to read each key sequential and
madsbk marked this conversation as resolved.
Show resolved Hide resolved
ignores the meta_array argument. Overwrite this method to implement concurrent
reads of multiple keys and/or to utilize the meta_array argument.
"""

return {k: self[k] for k in keys if k in self}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Know fsspec-based storage layers are using a dict return value, but am wondering if we should be doing something different (like returning an iterable). Asking since this would make the read blocking vs. a bit more lazy. The latter can be useful when working with operations that take a bit longer (like reading from the cloud or parallelizing several reads)

cc @martindurant (who also may have thoughts)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point of it being blocking, is that many keys may be being fetched concurrently. If you make it an iterator, you lose that, unless you have something that can wait on an async iterator, which in turn has first-completed working.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with both of you.
The API specifies a return type of Mapping[str, Any] thus it is possible to return a lazy mapping that only reads self[k] when accessed. But let's do that in a follow up PR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood, and will be interested in how that looks.

By the way, I mentioned elsewhere the possibility of passing key-specific metadata to the get function; that would happen in this same signature. I wonder if you have any use for an array where only some of it is destined for the GPU (perhaps because that data already exists there and doesn't need loading at all).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have an example of what they would look like, @martindurant?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As noted above, think this discussion gets hairy enough it should be broken out into an issue (likely two) and discussed separately.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As noted above, think this discussion gets hairy enough it should be broken out into an issue (likely two) and discussed separately.

If we want to go the contexts route, we don't need a meta_array argument

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this also imply that getitem() should have a contexts parameter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this also imply that getitem() should have a contexts parameter?

Good point, yes getitem() should also take a contexts parameter

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait a second, there is no getitem() and _chunk_getitem() anymore :)



class Store(BaseStore):
"""Abstract store class used by implementations following the Zarr v2 spec.
Expand Down
81 changes: 21 additions & 60 deletions zarr/core.py
Original file line number Diff line number Diff line change
Expand Up @@ -1257,20 +1257,13 @@ def _get_selection(self, indexer, out=None, fields=None):
else:
check_array_shape('out', out, out_shape)

# iterate over chunks
if not hasattr(self.chunk_store, "getitems") or \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️ for removing hasattr hacks in general. 👍

any(map(lambda x: x == 0, self.shape)):
# sequentially get one key at a time from storage
for chunk_coords, chunk_selection, out_selection in indexer:

# load chunk selection into output array
self._chunk_getitem(chunk_coords, chunk_selection, out, out_selection,
drop_axes=indexer.drop_axes, fields=fields)
else:
# allow storage to get multiple items at once
if math.prod(out_shape) > 0:
# get chunks
lchunk_coords, lchunk_selection, lout_selection = zip(*indexer)
self._chunk_getitems(lchunk_coords, lchunk_selection, out, lout_selection,
drop_axes=indexer.drop_axes, fields=fields)
self._chunk_getitems(
lchunk_coords, lchunk_selection, out, lout_selection,
drop_axes=indexer.drop_axes, fields=fields
)

if out.shape:
return out
Expand Down Expand Up @@ -1930,76 +1923,44 @@ def _process_chunk(
# store selected data in output
out[out_selection] = tmp

def _chunk_getitem(self, chunk_coords, chunk_selection, out, out_selection,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this so private that no one could have made use of it in a subclass?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, AFAICT.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's my understanding as well. Or at least by prefixing with _ we have warned users this is an implementation detail (subject to change) that they shouldn't rely on.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jakirkham: this pre-dates me, so I defer. But if it's not documented somewhere we might want to review if that holds across the board. For me, _x is typically valid for subclassing by developers, otherwise it would be a __x. (i.e. public, protected, private in Java-parlance)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a Python thing generally (not specific to Zarr).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Single underscore if by far the most common thing to do, to show intent rather than enforce any privateness (which double underscore doesn't do either).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drop_axes=None, fields=None):
"""Obtain part or whole of a chunk.
def _chunk_getitems(self, lchunk_coords, lchunk_selection, out, lout_selection,
drop_axes=None, fields=None):
"""Obtain part or whole of chunks.

Parameters
----------
chunk_coords : tuple of ints
Indices of the chunk.
chunk_selection : selection
Location of region within the chunk to extract.
chunk_coords : list of tuple of ints
Indices of the chunks.
chunk_selection : list of selections
Location of region within the chunks to extract.
out : ndarray
Array to store result in.
out_selection : selection
Location of region within output array to store results in.
out_selection : list of selections
Location of regions within output array to store results in.
drop_axes : tuple of ints
Axes to squeeze out of the chunk.
fields
TODO

"""
out_is_ndarray = True
try:
out = ensure_ndarray_like(out)
except TypeError:
out_is_ndarray = False

assert len(chunk_coords) == len(self._cdata_shape)

# obtain key for chunk
ckey = self._chunk_key(chunk_coords)

try:
# obtain compressed data for chunk
cdata = self.chunk_store[ckey]

except KeyError:
# chunk not initialized
if self._fill_value is not None:
if fields:
fill_value = self._fill_value[fields]
else:
fill_value = self._fill_value
out[out_selection] = fill_value

else:
self._process_chunk(out, cdata, chunk_selection, drop_axes,
out_is_ndarray, fields, out_selection)

def _chunk_getitems(self, lchunk_coords, lchunk_selection, out, lout_selection,
drop_axes=None, fields=None):
"""As _chunk_getitem, but for lists of chunks

This gets called where the storage supports ``getitems``, so that
it can decide how to fetch the keys, allowing concurrency.
"""
out_is_ndarray = True
try:
out = ensure_ndarray_like(out)
except TypeError: # pragma: no cover
out_is_ndarray = False

# Keys to retrieve
ckeys = [self._chunk_key(ch) for ch in lchunk_coords]

partial_read_decode = False
# Check if we can do a partial read
if (
self._partial_decompress
and self._compressor
and self._compressor.codec_id == "blosc"
and hasattr(self._compressor, "decode_partial")
and not fields
and self.dtype != object
and hasattr(self.chunk_store, "getitems")
):
partial_read_decode = True
cdatas = {
Expand All @@ -2008,8 +1969,8 @@ def _chunk_getitems(self, lchunk_coords, lchunk_selection, out, lout_selection,
if ckey in self.chunk_store
}
else:
partial_read_decode = False
cdatas = self.chunk_store.getitems(ckeys, on_error="omit")
cdatas = self.chunk_store.getitems(ckeys, meta_array=self._meta_array)
madsbk marked this conversation as resolved.
Show resolved Hide resolved

for ckey, chunk_select, out_select in zip(ckeys, lchunk_selection, lout_selection):
if ckey in cdatas:
self._process_chunk(
Expand Down
8 changes: 6 additions & 2 deletions zarr/storage.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@
from os import scandir
from pickle import PicklingError
from threading import Lock, RLock
from typing import Optional, Union, List, Tuple, Dict, Any
from typing import Sequence, Mapping, Optional, Union, List, Tuple, Dict, Any
import uuid
import time

Expand All @@ -41,6 +41,7 @@
ensure_text,
ensure_contiguous_ndarray_like
)
from numcodecs.ndarray_like import NDArrayLike
from numcodecs.registry import codec_registry

from zarr.errors import (
Expand Down Expand Up @@ -1363,7 +1364,10 @@ def _normalize_key(self, key):

return key.lower() if self.normalize_keys else key

def getitems(self, keys, **kwargs):
def getitems(
self, keys: Sequence[str], meta_array: NDArrayLike
) -> Mapping[str, Any]:

keys_transformed = [self._normalize_key(key) for key in keys]
results = self.map.getitems(keys_transformed, on_error="omit")
# The function calling this method may not recognize the transformed keys
Expand Down