Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add take specification for returning elements of an array along a specified axis #416

Merged
merged 10 commits into from
Nov 16, 2022

Conversation

kgryte
Copy link
Contributor

@kgryte kgryte commented Apr 7, 2022

This PR

  • adds a specification for take, an API for returning elements of an array along a specified axis. This API was initially proposed in Proposal: add APIs for getting and setting elements via a list of indices (i.e., take, put, etc) #177.
  • supports only a single keyword argument: axis. This is derived from comparison data.
  • requires that the axis keyword argument always be provided, as suggested in comment, but only when the input array has more than one dimension.
  • does not currently support providing a multi-dimensional array of indices. Array libraries supporting multi-dimensional arrays for indices treat the input array as a flattened 1-dimensional array. If we were to follow similarly, we'd need to specify the flattened layout.
  • does not currently support lists/tuples/sequences for the indices argument. NumPy et al allow array-like. Torch requires LongTensor. Was not immediately clear to me how accepting we should be for non-arrays, so went with the most conservative choice.
  • does not support an integer for the indices argument. Instead, users should use normal slice/indexing operations.

Questions

  • Should the API be extended to support multi-dimensional indices? No, not yet.
  • Should the axis argument be required? Yes, but only for multi-dimensional x.
  • Should the indices argument be more liberal in the types of allowed arguments? No, should be restricted to only array values.

@kgryte kgryte added API extension Adds new functions or objects to the API. topic: Indexing Array indexing. labels Apr 7, 2022
@kgryte kgryte requested a review from asmeurer April 7, 2022 09:53
@honno
Copy link
Member

honno commented Apr 7, 2022

Should this PR be blocked until an initial spec release?

@asmeurer
Copy link
Member

asmeurer commented Apr 7, 2022

Regarding axis and flattening, it's worth noting np.put doesn't have axis, so requiring it here would either break the symmetry with a potential put function, or require put to be modified from NumPy (see #177 (comment)).

@shoyer
Copy link
Contributor

shoyer commented Apr 7, 2022

Two thoughts:

  1. We could make axis optional only in the case of 1D arrays
  2. I would not necessarily be limited by constraints on np.put, which in my opinion is much more niche functionality than take. I don't even know if we want put in the API standard.

@kgryte
Copy link
Contributor Author

kgryte commented Apr 18, 2022

@rgommers Before moving forward with this PR, we need to ensure buy-in from PyTorch. The current proposal deviates from supported PyTorch behavior in the following ways:

  1. PyTorch does not currently support an axis kwarg.
  2. PyTorch always flattens the input array.

We'd need to ensure that PyTorch is (1) okay with adding an axis kwarg to support taking along a specified axis and (2) making an axis kwarg mandatory, thus breaking current behavior (i.e., based on the current proposal, no way to index into a flattened array using take which is the current PyTorch behavior).

For NumPy-inspired libraries, the axis kwarg is supported and, provided the spec-compliant API is in a sub-namespace, xp.take(x, indices, axis=-1) can simply wrap a call to np.take. Dask already effectively requires an axis kwarg, as the default axis value is 0.

Accordingly, PyTorch is the odd one out here, atm, so would be good to get a feel for potential specification constraints in order to ensure broad API compatibility, while also addressing fancy indexing concerns (indexing arrays via arrays) and the proposal to require an axis.

@kgryte kgryte added this to the v2021 milestone Apr 21, 2022
@kgryte kgryte modified the milestones: v2021, v2022 May 2, 2022
@rgommers
Copy link
Member

rgommers commented Jul 6, 2022

For NumPy-inspired libraries, the axis kwarg is supported and, provided the spec-compliant API is in a sub-namespace, xp.take(x, indices, axis=-1) can simply wrap a call to np.take. Dask already effectively requires an axis kwarg, as the default axis value is 0.

This requires a backwards-incompatible change in NumPy as well, when we want to move to support in the main namespace. It'd be a good change to make though. Dask not following the NumPy default is fairly exceptional for Dask, and was probably done because the flattening behavior is indeed not useful (EDIT: it was added in dask/dask#26, no discussion there on axis default).

I just had a look at SciPy, and it has quite a few np.take usages but none with the default axis=None. Scikit-learn does have some, I need to look in a little more detail to verify whether those are all for 1-D arrays or not.

@rgommers
Copy link
Member

rgommers commented Jul 7, 2022

I looked at all usages in scikit-learn and most inputs to np.take are 1-D, in which case the axis argument is not used/needed, and for the one case with 2-D array input the code uses axis=0:

# `processed` is 1-D
sklearn/cluster/_optics.py:    unproc = np.compress(~np.take(processed, indices), indices)
--
# `X` is 2-D, that's why this uses `axis=0`
sklearn/cluster/_optics.py:        dists = pairwise_distances(P, np.take(X, unproc, axis=0),
--
# `reachability_` is 1-D
sklearn/cluster/_optics.py:    improved = np.where(rdists < np.take(reachability_, unproc))
--
# `compute_class_weight` returns a 1-D array
sklearn/utils/class_weight.py:            weight_k = np.take(compute_class_weight(class_weight_k,
sklearn/utils/class_weight.py-                                                    classes=classes_subsample,
sklearn/utils/class_weight.py-                                                    y=y_subsample),
sklearn/utils/class_weight.py-                               np.searchsorted(classes_subsample,
sklearn/utils/class_weight.py-                                               classes_full),
sklearn/utils/class_weight.py-                               mode='clip')
--
# `labels_multiclass` and `labels_binary` are lists of strings (so 1-D)
sklearn/utils/estimator_checks.py:    y_names_multiclass = np.take(labels_multiclass, y_multiclass)
sklearn/utils/estimator_checks.py:    y_names_binary = np.take(labels_binary, y_binary)
sklearn/utils/estimator_checks.py:    y_names_binary = np.take(labels_binary, y_binary)
--
# `nominal_attributes` is a dict, so first argument to `take` is a 1-D object array
sklearn/datasets/_openml.py:                    np.take(
sklearn/datasets/_openml.py-                        np.asarray(nominal_attributes.pop(col_name),
sklearn/datasets/_openml.py-                                   dtype='O'),
sklearn/datasets/_openml.py-                        y[:, i:i + 1].astype(int, copy=False))
--
# `perm` is 1-D
sklearn/model_selection/tests/test_split.py:        y_perm = np.take(perm, y)
--
# `perm` is 1-D
sklearn/metrics/tests/test_common.py:        y_true_perm = np.take(perm, y_true)
--
# `labels_multiclass` is a list of strings (so 1-D)
sklearn/semi_supervised/tests/test_self_training.py:    y_strings = np.take(labels_multiclass, y)

@rgommers
Copy link
Member

rgommers commented Jul 7, 2022

does not currently support lists/tuples/sequences for the indices argument. NumPy et al allow array-like. Torch requires LongTensor. Was not immediately clear to me how accepting we should be for non-arrays, so went with the most conservative choice.

As a rule we don't allow array-like inputs in the array API standard, because (a) it's ill-defined, and (b) it turns out not to be the best idea to allow it anyway. So it's fine to be consistent with that here.

We could make axis optional only in the case of 1D arrays

This makes sense to me. And seems useful, based on my analysis of how scikit-learn uses np.take.

The current proposal deviates from supported PyTorch behavior in the following ways:

1. PyTorch does not currently support an `axis` kwarg.

2. PyTorch always flattens the input array.

We'd need to ensure that PyTorch is (1) okay with adding an axis kwarg to support taking along a specified axis and (2) making an axis kwarg mandatory, thus breaking current behavior (i.e., based on the current proposal, no way to index into a flattened array using take which is the current PyTorch behavior).

I don't think there's an issue with progressing this PR:

  • torch.take with two arguments is actually compatible with numpy.take with two arguments,
  • adding an axis keyword is not controversial for PyTorch, and would be fully backwards compatible if it were to default to None,
  • the introduction path would add axis=None (or better, dim=None) first, and then have the appropriate warnings raised before switching the default to 0,
  • the bc-breaking change would be to add axis=0; if as discussed above the axis keyword is mandatory for >=2-D array input, then there's still no risk of silent breakage, just some code churn. And this would be the same for PyTorch as for NumPy,
  • I'll note that this will be low-prio for PyTorch to work on though, because there's already take_along_dim that does what we propose here (again similar to np.take_along_axis), so it's just some churn but doesn't add functionality.

Copy link
Member

@rgommers rgommers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kgryte, overall looks good - a few comments.

One formatting comment: please wrap lines at 80 characters, that's standard in Python and far easier to read/review.

spec/API_specification/signatures/indexing_functions.py Outdated Show resolved Hide resolved
spec/API_specification/signatures/indexing_functions.py Outdated Show resolved Hide resolved
spec/API_specification/signatures/indexing_functions.py Outdated Show resolved Hide resolved
@seberg
Copy link
Contributor

seberg commented Jul 7, 2022

We could make axis optional only in the case of 1D arrays

Seems OK, since it is compatible with None as a default. But, is there much reason for it as in that case arr[index] is practically identical (unless you need to ensure a copy and index is a scalar)?

(Plausibly, in a few cases sklearn uses it because many years ago take was faster and either the myth or the code stuck around. One of the example uses mode="clip" though.)

I am not immediately sure whether 0 or -1 would make the better default. None is useful for reduce-likes, but take is maybe not quite reduce-like enough (but probably inherits it from there).

take_along_dim that does what we propose here (again similar to np.take_along_axis), so it's just some churn but doesn't add functionality.

take_along_axis is a very different functionality, it is only identical for 1-D inputs. take_along_axis specificalaly takes a arr.ndim dimensional index array as input.

@rgommers
Copy link
Member

rgommers commented Jul 7, 2022

Seems OK, since it is compatible with None as a default. But, is there much reason for it as in that case arr[index] is practically identical (unless you need to ensure a copy and index is a scalar)?

Except integer array indexing is not in the standard (yet at least), which is why there was so much interest in adding take.

take_along_axis is a very different functionality,

Good point, I overlooked that.

@kgryte
Copy link
Contributor Author

kgryte commented Nov 3, 2022

As discussed in the Consortium meeting on 22 September 2022, consensus was to limit the indices argument to one-dimensional arrays.

For now, the API will not support multi-dimensional arrays for the indices kwarg. In the interim, array API consumers should be able to use reshape to achieve something similar. If, based on feedback and usage, multi-dimensional indices support is desired, we can revisit the specification and update accordingly.

This PR has been updated to reflect current consensus.

@kgryte
Copy link
Contributor Author

kgryte commented Nov 3, 2022

As discussed in the 3 November 2022 meeting, no objections were raised. This PR should be ready for merge.

@rgommers would you mind taking a last look and then approving?

@rgommers
Copy link
Member

I went through the whole discussion here again, and it looks like we're all happy and there aren't any remaining blockers. This wasn't an easy one! In it goes, thanks all!

This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API extension Adds new functions or objects to the API. topic: Indexing Array indexing.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants