Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getitems: support meta_array #1131

Merged
merged 33 commits into from
Apr 13, 2023
Merged

Getitems: support meta_array #1131

merged 33 commits into from
Apr 13, 2023

Conversation

madsbk
Copy link
Contributor

@madsbk madsbk commented Sep 12, 2022

BaseStore now implements getitems(), which takes a meta_array argument. Besides simplifying the code, this enable stores to read directly to GPU memory: rapidsai/kvikio#131

TODO:

  • Add unit tests and/or doctests in docstrings
  • Add docstrings and API docs for any new/modified user-facing classes and functions
  • New/modified features documented in docs/tutorial.rst
  • Changes documented in docs/release.rst
  • GitHub Actions have all passed
  • Test coverage is 100% (Codecov passes)

@lgtm-com
Copy link

lgtm-com bot commented Sep 12, 2022

This pull request introduces 1 alert when merging 26b7d2d into b677db3 - view on LGTM.com

new alerts:

  • 1 for Signature mismatch in overriding method

@codecov
Copy link

codecov bot commented Sep 12, 2022

Codecov Report

Merging #1131 (a1d3520) into main (4b0705c) will not change coverage.
The diff coverage is 100.00%.

@@            Coverage Diff            @@
##              main     #1131   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           36        37    +1     
  Lines        14799     14855   +56     
=========================================
+ Hits         14799     14855   +56     
Impacted Files Coverage Δ
zarr/_storage/store.py 100.00% <100.00%> (ø)
zarr/context.py 100.00% <100.00%> (ø)
zarr/core.py 100.00% <100.00%> (ø)
zarr/storage.py 100.00% <100.00%> (ø)
zarr/tests/test_storage.py 100.00% <100.00%> (ø)
zarr/tests/test_storage_v3.py 100.00% <100.00%> (ø)
zarr/tests/test_util.py 100.00% <100.00%> (ø)
zarr/tests/util.py 100.00% <100.00%> (ø)
zarr/util.py 100.00% <100.00%> (ø)

... and 2 files with indirect coverage changes

@madsbk madsbk marked this pull request as ready for review September 13, 2022 11:16
@quasiben
Copy link

Is this the last PR needed by kvik for full Zarr support ?

@madsbk
Copy link
Contributor Author

madsbk commented Sep 21, 2022

Is this the last PR needed by kvik for full Zarr support ?

Yes with this PR and rapidsai/kvikio#131, all building blocks should be in place to use Zarr backed by CuPy arrays!

Copy link
Member

@jakirkham jakirkham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Mads! This is very helpful 🙏

Asked one question above about API design. Curious to hear yours and others thoughts 🙂

reads of multiple keys and/or to utilize the meta_array argument.
"""

return {k: self[k] for k in keys if k in self}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Know fsspec-based storage layers are using a dict return value, but am wondering if we should be doing something different (like returning an iterable). Asking since this would make the read blocking vs. a bit more lazy. The latter can be useful when working with operations that take a bit longer (like reading from the cloud or parallelizing several reads)

cc @martindurant (who also may have thoughts)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point of it being blocking, is that many keys may be being fetched concurrently. If you make it an iterator, you lose that, unless you have something that can wait on an async iterator, which in turn has first-completed working.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with both of you.
The API specifies a return type of Mapping[str, Any] thus it is possible to return a lazy mapping that only reads self[k] when accessed. But let's do that in a follow up PR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood, and will be interested in how that looks.

By the way, I mentioned elsewhere the possibility of passing key-specific metadata to the get function; that would happen in this same signature. I wonder if you have any use for an array where only some of it is destined for the GPU (perhaps because that data already exists there and doesn't need loading at all).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have an example of what they would look like, @martindurant?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As noted above, think this discussion gets hairy enough it should be broken out into an issue (likely two) and discussed separately.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As noted above, think this discussion gets hairy enough it should be broken out into an issue (likely two) and discussed separately.

If we want to go the contexts route, we don't need a meta_array argument

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this also imply that getitem() should have a contexts parameter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this also imply that getitem() should have a contexts parameter?

Good point, yes getitem() should also take a contexts parameter

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait a second, there is no getitem() and _chunk_getitem() anymore :)

@jakirkham
Copy link
Member

Is this the last PR needed by kvik for full Zarr support ?

Yes with this PR and rapidsai/kvikio#131, all building blocks should be in place to use Zarr backed by CuPy arrays!

Just to clarify, this is a quite useful optimization (and much appreciated!). Though is it a requirement (in that an error was hit somewhere)? Is there a bit more context about how this came up (particularly if an issue was encountered)?

@madsbk
Copy link
Contributor Author

madsbk commented Sep 26, 2022

Is this the last PR needed by kvik for full Zarr support ?

Yes with this PR and rapidsai/kvikio#131, all building blocks should be in place to use Zarr backed by CuPy arrays!

Just to clarify, this is a quite useful optimization (and much appreciated!). Though is it a requirement (in that an error was hit somewhere)? Is there a bit more context about how this came up (particularly if an issue was encountered)?

It is required in the sense that a backend doesn’t know the memory destination of a read without this PR. In KvikIO’s backend, we must rely on key names to infer if the data goes to host or device memory. E.g. the following code check for keys that should go to host memory always:

  if os.path.basename(fn) in [
      zarr.storage.array_meta_key,
      zarr.storage.group_meta_key,
      zarr.storage.attrs_key,
  ]:

This is a problem since a Store can define its own special keys with arbitrary names.

@jakirkham
Copy link
Member

Thanks for clarifying. Do we need setitems as well?

@madsbk
Copy link
Contributor Author

madsbk commented Sep 26, 2022

Thanks for clarifying. Do we need setitems as well?

I don't think it is as important as the getitems since a backend can always probe the input data and determine if it is host or device memory.
However, setitems would make it possible to implement concurrent write, which could benefit backends such as KvikIO significantly.

zarr/core.py Outdated Show resolved Hide resolved
@jakirkham
Copy link
Member

@joshmoore do you have thoughts on this one? 🙂

Copy link
Member

@joshmoore joshmoore left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor comments, @jakirkham. I do get the feeling that after the refactoring that enabled V3, we likely do need some storage-extension-specific documentation notes.

zarr/_storage/store.py Outdated Show resolved Hide resolved
zarr/_storage/store.py Outdated Show resolved Hide resolved
reads of multiple keys and/or to utilize the meta_array argument.
"""

return {k: self[k] for k in keys if k in self}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have an example of what they would look like, @martindurant?

@@ -1930,76 +1923,44 @@ def _process_chunk(
# store selected data in output
out[out_selection] = tmp

def _chunk_getitem(self, chunk_coords, chunk_selection, out, out_selection,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this so private that no one could have made use of it in a subclass?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, AFAICT.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's my understanding as well. Or at least by prefixing with _ we have warned users this is an implementation detail (subject to change) that they shouldn't rely on.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jakirkham: this pre-dates me, so I defer. But if it's not documented somewhere we might want to review if that holds across the board. For me, _x is typically valid for subclassing by developers, otherwise it would be a __x. (i.e. public, protected, private in Java-parlance)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a Python thing generally (not specific to Zarr).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Single underscore if by far the most common thing to do, to show intent rather than enforce any privateness (which double underscore doesn't do either).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

zarr/core.py Outdated
@@ -1257,20 +1257,13 @@ def _get_selection(self, indexer, out=None, fields=None):
else:
check_array_shape('out', out, out_shape)

# iterate over chunks
if not hasattr(self.chunk_store, "getitems") or \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️ for removing hasattr hacks in general. 👍

madsbk and others added 2 commits October 7, 2022 15:27
Co-authored-by: Josh Moore <josh@openmicroscopy.org>
@github-actions github-actions bot added the needs release notes Automatically applied to PRs which haven't added release notes label Mar 13, 2023
@madsbk
Copy link
Contributor Author

madsbk commented Mar 13, 2023

Resolved conflict.
I suggest we go with this approach for now and then unify get_partial_values() and getitems() later.

@rabernat
Copy link
Contributor

@joshmoore & @jakirkham - any objections to merging this as-is and deferring the broader discussions for future work?

@joshmoore
Copy link
Member

Thanks for the ping, @rabernat. I have to say I'm liking the typed Context better. I will read through again ASAP.

@martindurant
Copy link
Member

Notice the link to FITS reading: another example of per-chunk parameters that could be fixed by passing around a context, although it's not the only way to do it. I'd say that there are a number of efforts (some more and some less well developed) waiting to see what zarr will do.

@madsbk
Copy link
Contributor Author

madsbk commented Mar 30, 2023

@joshmoore, how can I help progress here? Maybe an online meeting?

@joshmoore
Copy link
Member

Sorry, @madsbk. My last comment was shortly before leaving on several weeks of travel and I lost of track of this. It wasn’t my intention to hold you up, and big 👍 for the ping. A few thoughts (from smallest to biggest) though I get the feeling that you are being ping-ponged back and forth between design issues:

  • If we're less than confident on this API, perhaps we should make this a keyword-only argument. That way the second position could potentially be used for something else and deprecation/renaming would be more straight-forward.
  • Unlike some of the context that was going to get passed, I have the feeling that the meta_array is more likely to be used for all chunks equally. Does that sound right? If so, I'm a bit hesitant to force someone to create dictionaries of N objects all with the same meta_array instance wrapped in new Context object. As I started to think about "wildcard" keys in the context argument or an additional default_context argument, I realized that perhaps we've not gotten this abstract right yet.
  • Finally, I've still not got a clear picture of what other fields on the Context object are going to help with the likes of partial read. Obviously, that's less to do your PR but just to callback to the original discussions about whether or not we knew what the API should look like.

@madsbk
Copy link
Contributor Author

madsbk commented Apr 11, 2023

If we're less than confident on this API, perhaps we should make this a keyword-only argument. That way the second position could potentially be used for something else and deprecation/renaming would be more straight-forward.

Agree, fixed.

Unlike some of the context that was going to get passed, I have the feeling that the meta_array is more likely to be used for all chunks equally. Does that sound right? If so, I'm a bit hesitant to force someone to create dictionaries of N objects all with the same meta_array instance wrapped in new Context object. As I started to think about "wildcard" keys in the context argument or an additional default_context argument, I realized that perhaps we've not gotten this abstract right yet.

I have added ConstantMap, which is a read-only map that maps all keys to the same constant value. This fixes the issue of having to create a Context for each key without complicating the semantic of getitems().

Finally, I've still not got a clear picture of what other fields on the Context object are going to help with the likes of partial read. Obviously, that's less to do your PR but just to callback to the original discussions about whether or not we knew what the API should look like.

I don't think any of us have a clear answer to this. Therefore, I suggest that we merge this PR and let people explore its possibilities. As long as Context is optional and we force Stores to implement a fallback if it doesn't exist, I think we are fine.

@rabernat rabernat merged commit b14f15f into zarr-developers:main Apr 13, 2023
jakirkham added a commit that referenced this pull request Apr 13, 2023
@madsbk
Copy link
Contributor Author

madsbk commented Apr 14, 2023

Thanks all!

@rabernat
Copy link
Contributor

Thank you for your patience Mads.

@madsbk madsbk mentioned this pull request Apr 18, 2023
@madsbk madsbk deleted the getitems branch April 20, 2023 09:48
rapids-bot bot pushed a commit to rapidsai/kvikio that referenced this pull request Jun 21, 2023
By using the new API in zarr-developers/zarr-python#1131, we do not have to guess whether to read into host or device memory. That is, no more filtering of specify keys like:
```python
 if os.path.basename(fn) in [ 
     zarr.storage.array_meta_key, 
     zarr.storage.group_meta_key, 
     zarr.storage.attrs_key, 
 ]: 
```

Notice, this PR is on hold until Zarr v2.15 is released 

Closes #119

UPDATE: Zarr v2.15 has been released

Authors:
  - Mads R. B. Kristensen (https://github.com/madsbk)

Approvers:
  - Lawrence Mitchell (https://github.com/wence-)
  - Jordan Jacobelli (https://github.com/jjacobelli)

URL: #131
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs release notes Automatically applied to PRs which haven't added release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants