-
-
Notifications
You must be signed in to change notification settings - Fork 278
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Codecs without array metadata #1632
Codecs without array metadata #1632
Conversation
zarr/v3/metadata.py
Outdated
@@ -154,20 +163,19 @@ class ShardingCodecIndexLocation(Enum): | |||
|
|||
|
|||
@frozen | |||
class CoreArrayMetadata: | |||
shape: ChunkCoords | |||
class ChunkMetadata: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking about the name for this class:
-
the 'chunk" part. None of the attributes of this class are specific to chunks per se -- if we aliased
chunk-shape
withshape
,data_type
todtype
, then we have a generic specification of a NDArray-like that wouldn't look out of place in numpy. So what if we drop the "chunk" references in the name and attributes?
The old name wasCoreArrayMetadata
. "core array metadata" feels like a better description than "chunk metadata", but it's a bit long, and I don't think we need to semantically distinguish between "core" arrays and "non-core arrays (i.e., chunks)" in code like this. -
the "metadata" part: I'm not sure that I would call the shape and dtype of an array "metadata" in the conventional sense -- "property" seems like a better fit? But i'm not sure. And we are already in a module called
metadata
, so repeating that word in class names might be redundant; what do you think of moving from the<Thing>Metadata
pattern to<Thing>
, as long as the "metadata-ness" is conveyed by the module name?
A couple of alternative names for this class, off the top of my head:
ArraySpec
ArrayInfo
ArrayProps
NDArray
(I kind of prefer this one, as long as we namespace it in the right module to make things unambiguous)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would find NDArray
confusing, because it is not an array but only contains information about an array. ArraySpec
or ArrayInfo
are fine with me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's just go with ArraySpec
and if people hate it we can find a better name at that time?
zarr/v3/metadata.py
Outdated
return CoreArrayMetadata( | ||
shape=self.shape, | ||
def get_chunk_metadata( | ||
self, _chunk_coords: ChunkCoords, runtime_configuration: RuntimeConfiguration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Neither _chunk_coords
nor get used here. I'm guessing that this function takes runtime_configuration
_chunk_coords
as an argument in anticipation of irregular chunking, but what's the use for (nvm, the GH ui didn't show the full return value) Maybe a docstring could help hereruntime_configuration
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm guessing that this function takes _chunk_coords as an argument in anticipation of irregular chunking,
Exactly! We can leave it out for now.
zarr/v3/codecs/transpose.py
Outdated
for x, i in enumerate(self.order): | ||
inverse_order[x] = i | ||
chunk_array = chunk_array.transpose(inverse_order) | ||
return chunk_array | ||
|
||
async def encode( | ||
self, | ||
chunk_array: np.ndarray, | ||
self, chunk_array: np.ndarray, chunk_metadata: ChunkMetadata |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it important that this method operate on chunks, as opposed to arrays (of which chunks are a subset)? If we want these functions to be agnostic to the semantics of chunking, and scoped more broadly to transforming arrays / bytes, then maybe renaming the chunk_array
and chunk_metadata
parameters to array
and array_metadata
would make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
transpose
is an ArrayArrayCodec
that is why the arguments are called chunk_array
in contrast to chunk_bytes
.
zarr/v3/codecs/transpose.py
Outdated
) -> Optional[np.ndarray]: | ||
chunk_array = chunk_array.transpose(self.order) | ||
return chunk_array | ||
|
||
def compute_encoded_size(self, input_byte_length: int) -> int: | ||
def compute_encoded_size(self, input_byte_length: int, _chunk_metadata: ChunkMetadata) -> int: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thoughts on compute_encoded_size
vs encoded_size
, or get_encoded_size
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All fine with me. I just copied it straight from the implementation recommendations in the spec.
Am I right in thinking that Can we be sure that |
attrs calls it
Very likely, there will be additional required arguments in the future. I already declared |
ok! I'm fine with
Another approach that could keep a constant function signature would be to use a single keyword argument, One broad question: what do you think about using |
Wouldn't that weaken the type "safety"?
In zarrita, I only used |
Does the the |
It is the other way around: |
In what way? I had a think about this, and a fundamental issue is that if the
Aha, sorry for misreading you. In that case, If |
zarr/v3/codecs/__init__.py
Outdated
) -> Tuple[ArrayBytesCodec, ArraySpec]: | ||
for codec, chunk_metadata in codecs: | ||
if isinstance(codec, ArrayBytesCodec): | ||
return (codec, chunk_metadata) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@normanrz - can you comment on whether we have a check elsewhere to make sure we never have two ArrayBytesCodec
s in the array metadata object?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
zarr/v3/codecs/__init__.py
Outdated
chunk_metadata: ArraySpec, | ||
runtime_configuration: RuntimeConfiguration, | ||
) -> np.ndarray: | ||
codecs = list(self._codecs_with_resolved_metadata(chunk_metadata))[::-1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you think it would be easier to reason about this if we split the codec pipeline into three groups:
BytesBytesCodec
sBytesArrayCodec
sArrayArrayCodec
s
That would cut down on the number of instance checks later on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
alternatively, could we do a tagged union by adding a codec_type
attribute with type Literal["BytesBytes", "BytesArray", "ArrayArray"]
? then we just need 1 class definition and attribute checks instead of isinstance
(edit: we probably need 3 classes, because the function signatures of encode
/decode
are different, but I do think putting a codec_type
attribute on those classes gives us discrimination that's a bit cleaner than isinstance
checks)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, my proposal is actually not an alternative to @jhamman's idea. i think we could do both of these things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worth noting that the proposal to partition the codecs by type actually brings us back to the design of hdf5 / v2, where there is a separate filter pipeline (filters basically being ArrayArray
transformations).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure it makes anything more readable: b825b4f
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
alternatively, could we do a tagged union by adding a
codec_type
attribute with typeLiteral["BytesBytes", "BytesArray", "ArrayArray"]
? then we just need 1 class definition and attribute checks instead ofisinstance
(edit: we probably need 3 classes, because the function signatures ofencode
/decode
are different, but I do think putting acodec_type
attribute on those classes gives us discrimination that's a bit cleaner thanisinstance
checks)
I think isinstance
actually works pretty well here (from developer and mypy pov). Why would an attribute check be better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An attribute check would be better if we wanted to enable developers to use codecs without requiring that they inherit from our base classes (we could also achieve this with structural subtyping). And isinstance(x, ArrayArrayCodec)
is teeny bit more work to read / write than x.codec_type == "ArrayArray"
. That being said, I don't think either of these complaints should block your efforts here, so feel free to ignore
zarr/v3/codecs/__init__.py
Outdated
if chunk_bytes_maybe is None: | ||
return None | ||
chunk_bytes = chunk_bytes_maybe | ||
|
||
return chunk_bytes | ||
|
||
def compute_encoded_size(self, byte_length: int) -> int: | ||
return reduce(lambda acc, codec: codec.compute_encoded_size(acc), self.codecs, byte_length) | ||
async def encode_partial( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what would the consequences be for removing encode
, renaming encode_partial
to encode
, and using using exceptions for codecs that can't do partial encoding (i.e., for these codecs, if selection
is anything other than a full selection, raise an exception, which the caller can handle). logically, encode_partial
and decode_partial
are just more general versions of encode
and decode
, so it feels safe to rely on this fact in the API design.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think encode
and encode_partial
are different enough to keep them separate. encode_partial
has a number of limitations, the main one being that it is currently only works with a single codec. Also, encode_partial
takes a reference to the store to load the data itself; encode
takes a buffer.
@normanrz - a quick rebase and this should be good to go. |
Co-authored-by: Joe Hamman <jhamman1@gmail.com>
c503994
to
b6a6d41
Compare
This PR removes the
array_metadata
attribute from the codecs. That makes it possible to construct Codec classes that can be reused for multiple arrays. It is kind of a precursor for #1623.Instead there is now a
chunk_metadata
argument passed to theencode
anddecode
methods. This chunk metadata can be the same object for all chunks, which is valid for regular chunk grids. When we have variable chunking in the future, each chunk can have its own metadata passed through the codec pipeline.There are 2 new methods on the Codec interface:
validate
validates a codec configuration against array metadata. Consequently, it can only be called after the array metadata has been constructed (which also contains a list of codecs). For example,validate
can be used to validate that the number of dimensions in theorder
attribute in thetranspose
configuration matches the number of dimensions in the array. It cannot modify the codec configuration but raise exceptions.evolve
can be used to create a new codec with "fixed" configuration. For example, theblosc
codec has atypesize
attribute, which can be trivially be determined when knowing the dtype of the array. It would be poor UX to require this to be explicitly set by the user when constructing the codec. Theevolve
method can perform this modification. It only has 2 arguments for nowndim
anddata_type
. It is called before constructing the array metadata so we can't pass that in. I don't think this is the best interface we can come up with. Ideas welcome.TODO: