Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EA: support basic 2D operations #27142

Closed
wants to merge 36 commits into from

Conversation

jbrockmendel
Copy link
Member

This is one of (I think) two things we will actually require of EA authors. The other is we will need something like view to return a new EA object with a view on the same data. That will be implemented in a separate PR following discussion.

cc @jreback @jorisvandenbossche @TomAugspurger

pandas/core/arrays/datetimelike.py Outdated Show resolved Hide resolved
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can u add a test that asserts shape is defined
and that len is also reasonable (if it’s overridden)

@TomAugspurger
Copy link
Contributor

Remind me, why is this necessary? Is this for the reshape mixin? This will be a breaking change for EA authors.

@jbrockmendel
Copy link
Member Author

Remind me, why is this necessary? Is this for the reshape mixin? This will be a breaking change for EA authors.

Yes. It is a breaking change, but it is also much more robust than the other options (e.g. metaclass) which I found to be fragile MRO-wise.

@jbrockmendel
Copy link
Member Author

Updated to include parts of #27153, the upshot of which is that this now includes all the expected breaking changes for EA authors.

  • EAs must implement view that returns a new object viewing the same data
  • EAs that don't natively support 2D need to apply a implement_2d decorator.
    • Unlike in the original version of this PR, they do NOT need to implement shape, can stick with __len__.

Other breaking changes that aren't EA-author-specific:

  • Categorical.ravel() now returns Categorical instead of ndarray
  • DTA.view(), TDA.view(), PA.view() return DTA/TDA/PA instead of ndarray

@jreback
Copy link
Contributor

jreback commented Jul 2, 2019

can you update the title of the PR to better reflect things

from pandas._libs.lib import is_integer


def implement_2d(cls):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the reasoning behind making this a decorator as opposed to just having base class method? it seems much simpler and more obvious.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main pushback on this idea was the onus put on EA authors who only support 1D. ATM this decorator mainly just renames __len__ to size so that it can re-define shape in a 2D-compatible way. So asking authors to use the decorator instead of doing that re-definition themselves is kind of a wash.

But the step after this is to patch __getitem__, __setitem__, take, __iter__, all of which are easier to do with the decorator than by asking authors to do it themselves.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the plan still to have an "opt me out of this, I natively support 2D arrays"? If we go down the route of putting every Block.values inside an EA (with PandasArray for the current NumPy-backed Blocks), then we'll want that, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see you're applying this to subclasses, rather than ExtensionArray itself. Motivation for that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Motivation for that?

less magic than a metaclass (and my first attempt at using a metaclass failed)

Is the plan still to have an "opt me out of this, I natively support 2D arrays"?

Defined EA._allows_2d = False. Authors set that to True if they implement this themselves. This decorator should be updated to check that and be a no-op in that case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, right, because simply decorating

@implements_2d
class ExtensionArray:

would work for subclasses right? It has to be a metaclass. I'll poke around at the meta class approach to see what's going on. May be too magical.

But if we do go with a

@jbrockmendel jbrockmendel changed the title EA: require EA to define shape instead of __len__ EA: support basic 2D operations Jul 2, 2019
@TomAugspurger
Copy link
Contributor

@jbrockmendel is this the PR to focus on for 2D EA stuff?

Is there anything you definitely want to have in the release / release candidate?

@jbrockmendel
Copy link
Member Author

is this the PR to focus on for 2D EA stuff?

Yes.

Is there anything you definitely want to have in the release / release candidate?

If we're going to do a deprecation cycle for the behavior of Categorical.ravel, I'd like to get that in.

@jorisvandenbossche
Copy link
Member

is this the PR to focus on for 2D EA stuff?

Yes.

I am a bit lost in the different PRs / POC you made.
Can you explain how this relates to the other PRs? How does this differ and why did you decide to take this path (compared to the other PRs).

@jbrockmendel
Copy link
Member Author

I am a bit lost in the different PRs / POC you made.

Fair enough. Quick summary:

How does this differ and why did you decide to take this path (compared to the other PRs).

The approach for this PR centers around two goals:

  • Keep the scope limited (self-explanatory)
  • Minimize what we ask of downstream authors

We can patch most of the relevant methods of a 1D-only EA using a decorator, here implemented as implement_2d (note ATM this decorator only patches shape, len, size. In follow-ups it will also patch take, getitem, setitem).

One thing that AFAICT we can't patch and do need the EA authors to implement is view (I'm open to other names) to get a new object with the same underlying data. This is needed for reshape, T, and ravel.

Finally, the _allows_2d attribute is there so authors can say "No, don't patch anything, I've already implemented 2D support".

@TomAugspurger
Copy link
Contributor

@jreback @jbrockmendel are you free for a call this afternoon? I'm free any time after ~1:00 central.

@jreback
Copy link
Contributor

jreback commented Jul 23, 2019 via email

@jbrockmendel
Copy link
Member Author

Today after 4 EST works for me.

@TomAugspurger
Copy link
Contributor

Let's do 4:00 EST / 3:00 Central / 1:00 Pacific then. meet.google.com/bxs-pjaf-oqi

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still looking through the patched methods.

pandas/core/arrays/_reshaping.py Show resolved Hide resolved

def __iter__(self):
if self.ndim == 1:
for obj in orig_iter(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does just return orig_iter(self) work?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC from one of the proofs-of-concept this actually mattered, but I dont remember which direction it mattered in

for obj in orig_iter(self):
yield obj
else:
for n in range(self.shape[0]):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this case be written as something like

reduced = self[:, 0]
return orig_iter(reduced)`

? I worry a bit about the cost of calling __getitem__ on each iteration.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely dont want to call orig_iter here since we want to yield ndim-1 arrays

from pandas._libs.lib import is_integer


def implement_2d(cls):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the plan still to have an "opt me out of this, I natively support 2D arrays"? If we go down the route of putting every Block.values inside an EA (with PandasArray for the current NumPy-backed Blocks), then we'll want that, right?


key = expand_key(key, self.shape)
if is_integer(key[0]):
assert key[0] in [0, -1]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this at a glance. Who does it have to be in [0, -1]? (I get that we're only allowing N x 1 and 1 X N arrays, but I don't get why key[0] is the one being checked).

Just from looking at it, I would expect this to break this __getitem__ on something like

arr = pd.array([1, 2, 3], dtype="Int64")
# arr = arra.reshape(...)
arr[1]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I wrote this with only (1, N) in mind

"""
if len(shape) == 1:
return True
if len(shape) > 2:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this come up in practice? I'm having a hard time judging whether NotImplementedError or False is the right behavior in this case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think NotImplementedError is more accurate, but for the current implementation this should never be reached

def tuplify_shape(size: int, shape, restrict=True) -> Tuple[int, ...]:
"""
Convert a passed shape into a valid tuple.
Following ndarray.reshape, we accept either `reshape(a, b)` or
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't suppose NumPy has a function we can borrow here? I'm not aware of one.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

numpy/numpy#13768 is the closest thing I found, but I agree it seems like the kind of thing that should exist.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's a somewhat dumb approach:

def tuplify_shape(size, shape):
    return np.broadcast_to(None, size).reshape(*shape).shape

Tested as:

In [20]: tuplify_shape(100, (2, -1, 5))
Out[20]: (2, 10, 5)

In [21]: tuplify_shape(100, ((2, -1, 5)))
Out[21]: (2, 10, 5)

In [22]: tuplify_shape(100, ((2, 11, 5)))
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-22-f71a2f488ea8> in <module>()
----> 1 tuplify_shape(100, ((2, 11, 5)))

<ipython-input-19-6b4139eccdff> in tuplify_shape(size, shape)
      1 def tuplify_shape(size, shape):
----> 2     return np.broadcast_to(None, size).reshape(*shape).shape

ValueError: cannot reshape array of size 100 into shape (2,11,5)

-----
- This must return a *new* object, not self.
- The only case that *must* be implemented is with dtype=None,
giving a view with the same dtype as self.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can / should we make any requirements on this being no-copy? i.e.

a = my_array(...)
b = a.view(dtype=int)
b[0] = 10
assert a[0] == b[0]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

definitely needs to be no-copy; i'll add that to the docstring

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is a test for the no-copy bit

@@ -932,6 +977,26 @@ def _formatting_values(self) -> np.ndarray:
# Reshaping
# ------------------------------------------------------------------------

def reshape(self, *shape):
"""
Return a view on this array with the given shape.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to document that only (1, N) and (N, 1) is allowed here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good thinking. ATM I wrote most of this with only (1, N) in mind

from pandas._libs.lib import is_integer


def implement_2d(cls):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see you're applying this to subclasses, rather than ExtensionArray itself. Motivation for that?

@jreback
Copy link
Contributor

jreback commented Jul 23, 2019

@TomAugspurger what is the google link?

@jorisvandenbossche
Copy link
Member

Did you have a call? If so, is there a summary of what has been discussed?

@jbrockmendel
Copy link
Member Author

@jorisvandenbossche yes we did, but nothing written AFAIK. The main takeaway I got is that jreback is on board contingent on there being zero additional (required) complexity for downstream authors.

@TomAugspurger
Copy link
Contributor

Which essentially means a either

  1. metaclass to patch take, getitem, etc. at import time.
  2. monkeypatch or something at runtime when the array gets to internals.

My preference is for 1 if it's doable.

@@ -888,6 +888,7 @@ Other API changes
- :meth:`ExtensionArray.argsort` places NA values at the end of the sorted array. (:issue:`21801`)
- :meth:`DataFrame.to_hdf` and :meth:`Series.to_hdf` will now raise a ``NotImplementedError`` when saving a :class:`MultiIndex` with extention data types for a ``fixed`` format. (:issue:`7775`)
- Passing duplicate ``names`` in :meth:`read_csv` will now raise a ``ValueError`` (:issue:`17346`)
- :meth:`Categorical.ravel` will now return a :class:`Categorical` instead of a NumPy array. (:issue:`27153`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this change in 0.25? (IIRC yes?), if not can you break it out

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we only recently deprecated the old behavior. will double-check and see whats appropriate


cls.__iter__ = __iter__

return cls
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if we are still doing this approach, but if we are

I would break out each of the implementations into a well named function that takes cls. then impelemented_2d is pretty straightforward to read w/o having to understand the details, you can immediately see what is being changed and the details exist in the functions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed.

ATM the relevant discussion about how to proceed is in Tom's PR here

@jbrockmendel
Copy link
Member Author

Closing pending #28389.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement ExtensionArray Extending pandas with custom dtypes or arrays.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants