Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Identify numpy-like zero-dimensional arrays as non-iterable #35131

Closed
znicholls opened this issue Jul 5, 2020 · 2 comments · Fixed by #44626
Closed

ENH: Identify numpy-like zero-dimensional arrays as non-iterable #35131

znicholls opened this issue Jul 5, 2020 · 2 comments · Fixed by #44626
Labels
Enhancement Internals Related to non-user accessible pandas implementation
Milestone

Comments

@znicholls
Copy link
Contributor

Note: I wasn't sure whether to label this as an enhancement or bug. I've put enhancement as I don't think the original implementation was actually incorrect, rather just limited in scope. I'm happy to change to bug if that's more appropriate.

Is your feature request related to a problem?

I'm one of the developers of Pint-Pandas, which provides an extension type for storing numeric data with units (and, once mature, may solve #10349). The problem we're having is that we can't pass all the extension type tests. In particular, we fail many tests due to Panda's current implementation of is_list_like.

As @hgrecco shows here, pandas determines whether something is_list_like here (code copied below).

cdef inline bint c_is_list_like(object obj, bint allow_sets) except -1:
    return (
        isinstance(obj, abc.Iterable)
        # we do not count strings/unicode/bytes as list-like
        and not isinstance(obj, (str, bytes))
        # exclude zero-dimensional numpy arrays, effectively scalars
        and not (util.is_array(obj) and obj.ndim == 0)
        # exclude sets if allow_sets is False
        and not (allow_sets is False and isinstance(obj, abc.Set))
    )

At the moment, all Pint Quantity instances have an __iter__ method. It is designed to raise a TypeError if the quantity isn't actually iterable (code here).

The problem is that the __iter__ method is defined, regardless of whether the Quantity instance is actually iterable or not. This is a problem because it means that, in all cases, isinstance(quantity_instance, abc.Iterable) returns True (even if quantity_instance isn't actually iterable and would return a TypeError as soon as one tried to iterate with it).

A numpy array suffers from the same issue which is where the not (util.is_array(obj) and obj.ndim == 0) in Panda's is_list_like is important, because it excludes scalars. The problem is that util.is_array(obj) returns False if obj is anything other than np.ndarray. As a result, we can't just add an ndim attribute to Quantity because the obj.ndim == 0 check will never be evaluated anyway.

Describe the solution you'd like

Update is_list_like so that it correctly identifies implementations of scalars in numpy-like arrays (I think these are called 'duck numpy arrays'?). I've put a draft in #35127 and would be grateful for feedback on the proposed solution there. If there's a better solution, I'm also happy to implement that instead.

API breaking implications

is_list_like would now return False for scalars of numpy-like array implementations. If we used the implementation in #35127 then any object that has an ndim attribute and has obj.ndim == 0 would no longer be identified as list-like.

Describe alternatives you've considered

We've discussed how we could get the behaviour we want by changing Pint, rather than changing Pandas, but it seems to be impossible if we want to maintain the numpy-like implementation without separate scalar and array classes (see discussion here).

Additional context

>>> import numpy as np
>>> import pandas as pd
>>> import pint
>>> from collections import abc

>>> numpy_scalar = np.array(2)
>>> numpy_scalar
array(2)
# numpy 0d-array (i.e. scalar) quantities appear to be iterable even though they're not
>>> isinstance(numpy_scalar, abc.Iterable)
True
>>> numpy_scalar[0]
...
IndexError: too many indices for array: array is 0-dimensional, but 1 were indexed

# however, numpy 0d-array quantities are correctly handled by Pandas
>>> pd.core.dtypes.inference.is_list_like(numpy_scalar)
False

# compare this with Pint's numpy-like implementation
>>> ureg = pint.UnitRegistry()
>>> quantity_scalar = 3 * ureg.meter
>>> quantity_scalar
<Quantity(3, 'meter')>

# similar to a numpy array, scalar quantities appear to be iterable even though they're not
>>> isinstance(quantity_scalar, abc.Iterable)
True
>>> quantity_scalar[0]  
...
TypeError: Neither Quantity object nor its magnitude (3)supports indexing

# however, Pandas identifies Pint's implementation as being iterable which is what is causing
# our problem
>>> pd.core.dtypes.inference.is_list_like(quantity_scalar)
True  # we would like this to be False
@jorisvandenbossche
Copy link
Member

@znicholls Thanks for the clearly described issue!

We are having a similar problem in GeoPandas where the scalars (eg MultiPolygon) can be potentially iterable as well / can have a length (although here they are actually iterable, so not raising an error, making it even harder to detect this). xref #26333, #27911

So in those referenced issues, it is the length-check that is problematic. However in the case of pint, it might be easier to solve with the suggest ndim check.
One question here, though, the ndim check will only work with pint scalars that wrap a numpy scalar, I think? (so not for 3 * ureg.meter, but only for np.int64(3) * ureg.meter or (np.array([3] * ureg.meter)[0])

@jorisvandenbossche jorisvandenbossche removed the Needs Triage Issue that has not been reviewed by a pandas team member label Jul 6, 2020
@znicholls
Copy link
Contributor Author

One question here, though, the ndim check will only work with pint scalars that wrap a numpy scalar, I think?

Correct. To make this work, I think we'd have to add an ndim property to Pint's Quantity class that returned sensible values for all types (so we'd need custom behaviour to make sure e.g. quantity.ndim returns 0 even if you're wrapping a 'normal scalar' rather than a numpy scalar).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Internals Related to non-user accessible pandas implementation
Projects
None yet
4 participants