Codecs without array metadata #1632

normanrz · 2024-01-12T16:58:46Z

This PR removes the array_metadata attribute from the codecs. That makes it possible to construct Codec classes that can be reused for multiple arrays. It is kind of a precursor for #1623.

Instead there is now a chunk_metadata argument passed to the encode and decode methods. This chunk metadata can be the same object for all chunks, which is valid for regular chunk grids. When we have variable chunking in the future, each chunk can have its own metadata passed through the codec pipeline.

There are 2 new methods on the Codec interface:

validate validates a codec configuration against array metadata. Consequently, it can only be called after the array metadata has been constructed (which also contains a list of codecs). For example, validate can be used to validate that the number of dimensions in the order attribute in the transpose configuration matches the number of dimensions in the array. It cannot modify the codec configuration but raise exceptions.
evolve can be used to create a new codec with "fixed" configuration. For example, the blosc codec has a typesize attribute, which can be trivially be determined when knowing the dtype of the array. It would be poor UX to require this to be explicitly set by the user when constructing the codec. The evolve method can perform this modification. It only has 2 arguments for now ndim and data_type. It is called before constructing the array metadata so we can't pass that in. I don't think this is the best interface we can come up with. Ideas welcome.

TODO:

Add unit tests and/or doctests in docstrings
Add docstrings and API docs for any new/modified user-facing classes and functions
New/modified features documented in docs/tutorial.rst
Changes documented in docs/release.rst
GitHub Actions have all passed
Test coverage is 100% (Codecov passes)

d-v-b · 2024-01-13T10:19:48Z

zarr/v3/metadata.py

@@ -154,20 +163,19 @@ class ShardingCodecIndexLocation(Enum):


 @frozen
-class CoreArrayMetadata:
-    shape: ChunkCoords
+class ChunkMetadata:


I'm thinking about the name for this class:

the 'chunk" part. None of the attributes of this class are specific to chunks per se -- if we aliased chunk-shape with shape, data_type to dtype, then we have a generic specification of a NDArray-like that wouldn't look out of place in numpy. So what if we drop the "chunk" references in the name and attributes?
The old name was CoreArrayMetadata. "core array metadata" feels like a better description than "chunk metadata", but it's a bit long, and I don't think we need to semantically distinguish between "core" arrays and "non-core arrays (i.e., chunks)" in code like this.

the "metadata" part: I'm not sure that I would call the shape and dtype of an array "metadata" in the conventional sense -- "property" seems like a better fit? But i'm not sure. And we are already in a module called metadata, so repeating that word in class names might be redundant; what do you think of moving from the <Thing>Metadata pattern to <Thing>, as long as the "metadata-ness" is conveyed by the module name?

A couple of alternative names for this class, off the top of my head:

ArraySpec

ArrayInfo

ArrayProps

NDArray (I kind of prefer this one, as long as we namespace it in the right module to make things unambiguous)

I would find NDArray confusing, because it is not an array but only contains information about an array. ArraySpec or ArrayInfo are fine with me.

let's just go with ArraySpec and if people hate it we can find a better name at that time?

zarr/v3/metadata.py

d-v-b · 2024-01-13T10:30:40Z

zarr/v3/metadata.py

-        return CoreArrayMetadata(
-            shape=self.shape,
+    def get_chunk_metadata(
+        self, _chunk_coords: ChunkCoords, runtime_configuration: RuntimeConfiguration


Neither _chunk_coords ~~nor runtime_configuration~~ get used here. I'm guessing that this function takes _chunk_coords as an argument in anticipation of irregular chunking, ~~but what's the use for runtime_configuration?~~ (nvm, the GH ui didn't show the full return value) Maybe a docstring could help here

I'm guessing that this function takes _chunk_coords as an argument in anticipation of irregular chunking,

Exactly! We can leave it out for now.

zarr/v3/codecs/transpose.py

d-v-b · 2024-01-13T11:32:42Z

zarr/v3/codecs/transpose.py

        for x, i in enumerate(self.order):
            inverse_order[x] = i
        chunk_array = chunk_array.transpose(inverse_order)
        return chunk_array

    async def encode(
-        self,
-        chunk_array: np.ndarray,
+        self, chunk_array: np.ndarray, chunk_metadata: ChunkMetadata


is it important that this method operate on chunks, as opposed to arrays (of which chunks are a subset)? If we want these functions to be agnostic to the semantics of chunking, and scoped more broadly to transforming arrays / bytes, then maybe renaming the chunk_array and chunk_metadata parameters to array and array_metadata would make sense?

transpose is an ArrayArrayCodec that is why the arguments are called chunk_array in contrast to chunk_bytes.

d-v-b · 2024-01-13T11:34:13Z

zarr/v3/codecs/transpose.py

    ) -> Optional[np.ndarray]:
        chunk_array = chunk_array.transpose(self.order)
        return chunk_array

-    def compute_encoded_size(self, input_byte_length: int) -> int:
+    def compute_encoded_size(self, input_byte_length: int, _chunk_metadata: ChunkMetadata) -> int:


Thoughts on compute_encoded_size vs encoded_size, or get_encoded_size?

All fine with me. I just copied it straight from the implementation recommendations in the spec.

zarr/v3/array_v2.py

d-v-b · 2024-01-15T10:29:16Z

evolve can be used to create a new codec with "fixed" configuration. For example, the blosc codec has a typesize attribute, which can be trivially be determined when knowing the dtype of the array. It would be poor UX to require this to be explicitly set by the user when constructing the codec. The evolve method can perform this modification. It only has 2 arguments for now ndim and data_type. It is called before constructing the array metadata so we can't pass that in. I don't think this is the best interface we can come up with. Ideas welcome.

Am I right in thinking that evolve is basically a copy, with constraints? Personally I prefer the word "copy" over "evolve", because that makes the semantics a bit more clear -- based on the name alone, it's not intuitive to me whether "evolve" modifies an object in place, or returns a new object. Of course intuitions can change with time 😉, so if lots of people are fine with "evolve" then I can adapt.

Can we be sure that ndim and data_type will be the only free parameters for all the codecs? If not, it might make sense to have a more open-ended API like the one pydantic uses for model_copy, wherein you pass an optional update parameter, which is a dict of {field_name: new_value}. Even if we keep evolve(self, ndim, data_type), it probably makes sense for that function to call a more general purpose copy method, unless we are super confident that ndim and data_type are the only free parameters.

normanrz · 2024-01-15T10:35:44Z

evolve can be used to create a new codec with "fixed" configuration. For example, the blosc codec has a typesize attribute, which can be trivially be determined when knowing the dtype of the array. It would be poor UX to require this to be explicitly set by the user when constructing the codec. The evolve method can perform this modification. It only has 2 arguments for now ndim and data_type. It is called before constructing the array metadata so we can't pass that in. I don't think this is the best interface we can come up with. Ideas welcome.

Am I right in thinking that evolve is basically a copy, with constraints? Personally I prefer the word "copy" over "evolve", because that makes the semantics a bit more clear -- based on the name alone, it's not intuitive to me whether "evolve" modifies an object in place, or returns a new object. Of course intuitions can change with time 😉, so if lots of people are fine with "evolve" then I can adapt.

attrs calls it evolve and dataclasses calls it replace. I think it is a bit more than a copy because it changes the attributes.

Can we be sure that ndim and data_type will be the only free parameters for all the codecs? If not, it might make sense to have a more open-ended API like the one pydantic uses for model_copy, wherein you pass an optional update parameter, which is a dict of {field_name: new_value}. Even if we keep evolve(self, ndim, data_type), it probably makes sense for that function to call a more general purpose copy method, unless we are super confident that ndim and data_type are the only free parameters.

Very likely, there will be additional required arguments in the future. I already declared ndim and data_type as keyword-only arguments. That essentially provides the dict-like semantics that you proposed with typing. Codecs can add a **_kwargs argument to their signature to eat up unknown arguments.

d-v-b · 2024-01-15T10:46:41Z

attrs calls it evolve and dataclasses calls it replace. I think it is a bit more than a copy because it changes the attributes.

ok! I'm fine with evolve.

Very likely, there will be additional required arguments in the future. I already declared ndim and data_type as keyword-only arguments. That essentially provides the dict-like semantics that you proposed with typing. Codecs can add a **_kwargs argument to their signature to eat up unknown arguments.

Another approach that could keep a constant function signature would be to use a single keyword argument, update, and implementations of evolve would define their own TypedDict for that kwarg. But I don't think this is a big deal either way.

One broad question: what do you think about using dtype internally instead of data_type? I know that the v3 spec uses the name data_type, but I think the user base of zarr-python is largely people working in the broader pydata ecosystem where dtype is standard, and so data_type as a kwarg / attriibute when dtype was available might be a source of friction.

normanrz · 2024-01-15T13:49:08Z

Very likely, there will be additional required arguments in the future. I already declared ndim and data_type as keyword-only arguments. That essentially provides the dict-like semantics that you proposed with typing. Codecs can add a **_kwargs argument to their signature to eat up unknown arguments.

Another approach that could keep a constant function signature would be to use a single keyword argument, update, and implementations of evolve would define their own TypedDict for that kwarg. But I don't think this is a big deal either way.

Wouldn't that weaken the type "safety"?

One broad question: what do you think about using dtype internally instead of data_type? I know that the v3 spec uses the name data_type, but I think the user base of zarr-python is largely people working in the broader pydata ecosystem where dtype is standard, and so data_type as a kwarg / attriibute when dtype was available might be a source of friction.

In zarrita, I only used dtype if it actually is an np.dtype. While data_type and dtype are similar, they are not exactly the same. Notably, the dtype also contains information about endianness. In the future, there will likely be data types that are not 1-to-1 translatable to dtypes.

d-v-b · 2024-01-15T16:52:32Z

In zarrita, I only used dtype if it actually is an np.dtype. While data_type and dtype are similar, they are not exactly the same. Notably, the dtype also contains information about endianness. In the future, there will likely be data types that are not 1-to-1 translatable to dtypes.

Does the the byteorder attribute of a np.dtype object not convey the endianness? And more broadly, I'm a little nervous about having two representations (dtype and data_type) of the same thing (same in the conceptual sense), as I can see this becoming a point of confusion. If data_type does end up using extra attributes beyond those present on np.dtype, I think the best way to do this would be by implementing an np.dtype-like class that either inherits from np.dtype or acts like it inherits from np.dtype, and I think we should call this attribute dtype inside the API, even if it's called data_type in the metadata document. In my opinion, there should probably be just 1 place where we convert an attribute called data_type to an attribute called dtype, and that's where in a function that consumes or generates a metadata document.

normanrz · 2024-01-15T17:11:16Z

Does the the byteorder attribute of a np.dtype object not convey the endianness?

It is the other way around: dtype has endianness via byteorder, data_type does not know about endianness.

d-v-b · 2024-01-15T17:15:51Z

Very likely, there will be additional required arguments in the future. I already declared ndim and data_type as keyword-only arguments. That essentially provides the dict-like semantics that you proposed with typing. Codecs can add a **_kwargs argument to their signature to eat up unknown arguments.

Another approach that could keep a constant function signature would be to use a single keyword argument, update, and implementations of evolve would define their own TypedDict for that kwarg. But I don't think this is a big deal either way.

Wouldn't that weaken the type "safety"?

In what way? I had a think about this, and a fundamental issue is that if the evolve function itself takes any keyword arguments (e.g., deep, to specify whether a deep or shallow copy should be used), then we need those keyword arguments to be separate from the names of the attributes we are updating. Putting all the updated attributes in their own dict avoids potential name collisions.

Does the the byteorder attribute of a np.dtype object not convey the endianness?

It is the other way around: dtype has endianness via byteorder, data_type does not know about endianness.

Aha, sorry for misreading you. In that case, If np.dtype is semantically "broader" than data_type, then it seems OK to use np.dtype internally, unless there's some something we need that it can't provide?

jhamman · 2024-01-29T16:21:21Z

zarr/v3/codecs/__init__.py

+) -> Tuple[ArrayBytesCodec, ArraySpec]:
+    for codec, chunk_metadata in codecs:
+        if isinstance(codec, ArrayBytesCodec):
+            return (codec, chunk_metadata)


@normanrz - can you comment on whether we have a check elsewhere to make sure we never have two ArrayBytesCodecs in the array metadata object?

Yes, that is in here https://github.com/normanrz/zarr-python/blob/codecs-without-array-metadata/zarr/v3/codecs/__init__.py#L49-L51

jhamman · 2024-01-29T16:26:42Z

zarr/v3/codecs/__init__.py

+        chunk_metadata: ArraySpec,
+        runtime_configuration: RuntimeConfiguration,
+    ) -> np.ndarray:
+        codecs = list(self._codecs_with_resolved_metadata(chunk_metadata))[::-1]


do you think it would be easier to reason about this if we split the codec pipeline into three groups:

BytesBytesCodecs

BytesArrayCodecs

ArrayArrayCodecs

That would cut down on the number of instance checks later on.

alternatively, could we do a tagged union by adding a codec_type attribute with type Literal["BytesBytes", "BytesArray", "ArrayArray"]? then we just need 1 class definition and attribute checks instead of isinstance (edit: we probably need 3 classes, because the function signatures of encode /decode are different, but I do think putting a codec_type attribute on those classes gives us discrimination that's a bit cleaner than isinstance checks)

ah, my proposal is actually not an alternative to @jhamman's idea. i think we could do both of these things.

Worth noting that the proposal to partition the codecs by type actually brings us back to the design of hdf5 / v2, where there is a separate filter pipeline (filters basically being ArrayArray transformations).

Not sure it makes anything more readable: b825b4f

alternatively, could we do a tagged union by adding a codec_type attribute with type Literal["BytesBytes", "BytesArray", "ArrayArray"]? then we just need 1 class definition and attribute checks instead of isinstance (edit: we probably need 3 classes, because the function signatures of encode /decode are different, but I do think putting a codec_type attribute on those classes gives us discrimination that's a bit cleaner than isinstance checks)

I think isinstance actually works pretty well here (from developer and mypy pov). Why would an attribute check be better?

An attribute check would be better if we wanted to enable developers to use codecs without requiring that they inherit from our base classes (we could also achieve this with structural subtyping). And isinstance(x, ArrayArrayCodec) is teeny bit more work to read / write than x.codec_type == "ArrayArray". That being said, I don't think either of these complaints should block your efforts here, so feel free to ignore

zarr/v3/metadata.py

d-v-b · 2024-01-29T17:23:55Z

zarr/v3/codecs/__init__.py

            if chunk_bytes_maybe is None:
                return None
            chunk_bytes = chunk_bytes_maybe

        return chunk_bytes

-    def compute_encoded_size(self, byte_length: int) -> int:
-        return reduce(lambda acc, codec: codec.compute_encoded_size(acc), self.codecs, byte_length)
+    async def encode_partial(


what would the consequences be for removing encode, renaming encode_partial to encode, and using using exceptions for codecs that can't do partial encoding (i.e., for these codecs, if selection is anything other than a full selection, raise an exception, which the caller can handle). logically, encode_partial and decode_partial are just more general versions of encode and decode, so it feels safe to rely on this fact in the API design.

I think encode and encode_partial are different enough to keep them separate. encode_partial has a number of limitations, the main one being that it is currently only works with a single codec. Also, encode_partial takes a reference to the store to load the data itself; encode takes a buffer.

zarr/v3/codecs/__init__.py

jhamman · 2024-02-06T17:15:06Z

@normanrz - a quick rebase and this should be good to go.

No code changes to Zarrita were made.

Co-authored-by: Joe Hamman <jhamman1@gmail.com>

normanrz added the V3 Affects the v3 branch label Jan 12, 2024

normanrz self-assigned this Jan 12, 2024

normanrz requested review from jhamman and d-v-b January 12, 2024 16:59

d-v-b reviewed Jan 13, 2024

View reviewed changes

zarr/v3/metadata.py Outdated Show resolved Hide resolved

d-v-b reviewed Jan 13, 2024

View reviewed changes

zarr/v3/codecs/transpose.py Outdated Show resolved Hide resolved

d-v-b reviewed Jan 13, 2024

View reviewed changes

zarr/v3/array_v2.py Outdated Show resolved Hide resolved

jhamman reviewed Jan 29, 2024

View reviewed changes

d-v-b reviewed Jan 29, 2024

View reviewed changes

zarr/v3/metadata.py Outdated Show resolved Hide resolved

d-v-b reviewed Jan 29, 2024

View reviewed changes

zarr/v3/codecs/__init__.py Outdated Show resolved Hide resolved

jhamman and others added 6 commits February 7, 2024 09:55

Pull Zarrita into Zarr-Python @ 78274781ad64aef95772eb4b083f7ea9b7d03d06

a850c98

No code changes to Zarrita were made.

apply zarr lint rules

67dd202

zarrita -> v3

7c02590

v3/abc [wip]

e56af0e

use abcs plus implementation notes

e48597f

working on making codecs extensible

991037d

normanrz and others added 12 commits February 7, 2024 09:55

adds index_location

44b8b5b

adds support for codec entry points

a5a3d70

adds tests from zarrita

68db7c5

fixes types

6c1ae25

Apply suggestions from code review

c62ad2a

Co-authored-by: Joe Hamman <jhamman1@gmail.com>

remove test codec from pyproject.toml

21590d2

removes array_metadata from codecs

278adec

implemented pr feedback

e061d38

ChunkMetadata -> ArraySpec

3a5c3a9

split codecs in 3 groups in CodecPipeline

36f1625

chunk_metadata -> chunk_spec

960c885

merge

b6a6d41

normanrz force-pushed the codecs-without-array-metadata branch from c503994 to b6a6d41 Compare February 7, 2024 13:55

normanrz merged commit 5e85500 into zarr-developers:v3 Feb 7, 2024
7 checks passed

normanrz deleted the codecs-without-array-metadata branch February 7, 2024 13:57

jhamman added this to the 3.0.0.alpha milestone Apr 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Codecs without array metadata #1632

Codecs without array metadata #1632

normanrz commented Jan 12, 2024

d-v-b Jan 13, 2024

normanrz Jan 13, 2024

d-v-b Jan 29, 2024

d-v-b Jan 13, 2024 •

edited

Loading

normanrz Jan 13, 2024

d-v-b Jan 13, 2024

normanrz Jan 13, 2024

d-v-b Jan 13, 2024

normanrz Jan 13, 2024

d-v-b commented Jan 15, 2024

normanrz commented Jan 15, 2024

d-v-b commented Jan 15, 2024

normanrz commented Jan 15, 2024

d-v-b commented Jan 15, 2024

normanrz commented Jan 15, 2024

d-v-b commented Jan 15, 2024

jhamman Jan 29, 2024

normanrz Jan 29, 2024

jhamman Jan 29, 2024

d-v-b Jan 29, 2024 •

edited

Loading

d-v-b Jan 29, 2024

d-v-b Jan 29, 2024

normanrz Jan 29, 2024

normanrz Jan 30, 2024

d-v-b Jan 30, 2024 •

edited

Loading

d-v-b Jan 29, 2024

normanrz Jan 30, 2024

jhamman commented Feb 6, 2024

Codecs without array metadata #1632

Codecs without array metadata #1632

Conversation

normanrz commented Jan 12, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

d-v-b Jan 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

d-v-b commented Jan 15, 2024

normanrz commented Jan 15, 2024

d-v-b commented Jan 15, 2024

normanrz commented Jan 15, 2024

d-v-b commented Jan 15, 2024

normanrz commented Jan 15, 2024

d-v-b commented Jan 15, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

d-v-b Jan 29, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

d-v-b Jan 30, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jhamman commented Feb 6, 2024

d-v-b Jan 13, 2024 •

edited

Loading

d-v-b Jan 29, 2024 •

edited

Loading

d-v-b Jan 30, 2024 •

edited

Loading