WIP/POC: Allow (N, 1) and (1,N) EAs #26914

jbrockmendel · 2019-06-18T01:30:21Z

Discussed on the mailing list last week, cc @TomAugspurger @jreback @jorisvandenbossche

The central claim: if we allow EAs to take on shapes (N, 1) and (1, N), then we can get some major simplifications in core.internals.

For the proof of concept I've made the Datetimelike arrays accept these shapes and updated DatetimeTZBlock to see what simplifications are available. The _wrap_data implementation would also work for Categorical and PandasArray.

Three categories of simplifications are not in here yet and you'll have to use your imaginations

some can only be made once all EAs allow these shapes, e.g. not having to pass ndim arg to Block constructors
some would require additional tweaks on the EA subclasses, e.g. making shift, __iter__, or take aware of 2Dness
some I just haven't got to yet. core.internals is rough.

Added bonus: this fleshes out bugs (see #26864)

Minimal follow-on edits to keep tests passing

codecov · 2019-06-18T04:01:05Z

Codecov Report

Merging #26914 into master will decrease coverage by 0.01%.
The diff coverage is 92.56%.

@@            Coverage Diff             @@
##           master   #26914      +/-   ##
==========================================
- Coverage   91.87%   91.85%   -0.02%     
==========================================
  Files         180      180              
  Lines       50712    50792      +80     
==========================================
+ Hits        46590    46657      +67     
- Misses       4122     4135      +13

Flag	Coverage Δ
#multiple	`90.45% <92.56%> (-0.01%)`	⬇️
#single	`41.12% <50.41%> (-0.06%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/arrays/datetimelike.py	`97.95% <100%> (+0.02%)`	⬆️
pandas/core/arrays/timedeltas.py	`88.82% <100%> (ø)`	⬆️
pandas/core/arrays/datetimes.py	`97.79% <100%> (ø)`	⬆️
pandas/core/groupby/ops.py	`96% <100%> (ø)`	⬆️
pandas/core/dtypes/concat.py	`96.6% <100%> (+0.01%)`	⬆️
pandas/core/internals/managers.py	`95.21% <100%> (ø)`	⬆️
pandas/core/internals/concat.py	`96.51% <100%> (+0.02%)`	⬆️
pandas/io/formats/format.py	`97.8% <83.33%> (-0.11%)`	⬇️
pandas/core/internals/blocks.py	`94.25% <90.9%> (-0.13%)`	⬇️
pandas/core/arrays/base.py	`98.52% <92%> (-0.92%)`	⬇️
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update baa77c3...99931d4. Read the comment docs.

codecov · 2019-06-18T04:01:07Z

Codecov Report

Merging #26914 into master will decrease coverage by <.01%.
The diff coverage is 94.95%.

@@            Coverage Diff             @@
##           master   #26914      +/-   ##
==========================================
- Coverage   91.97%   91.97%   -0.01%     
==========================================
  Files         180      180              
  Lines       50756    50833      +77     
==========================================
+ Hits        46685    46753      +68     
- Misses       4071     4080       +9

Flag	Coverage Δ
#multiple	`90.57% <94.95%> (ø)`	⬆️
#single	`41.84% <51.26%> (-0.07%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/dtypes/concat.py	`96.58% <ø> (ø)`	⬆️
pandas/core/arrays/datetimelike.py	`97.95% <100%> (+0.03%)`	⬆️
pandas/core/arrays/timedeltas.py	`88.82% <100%> (ø)`	⬆️
pandas/core/arrays/datetimes.py	`97.81% <100%> (+0.01%)`	⬆️
pandas/core/groupby/ops.py	`96% <100%> (ø)`	⬆️
pandas/core/internals/managers.py	`96% <100%> (ø)`	⬆️
pandas/core/internals/concat.py	`96.51% <100%> (+0.02%)`	⬆️
pandas/io/formats/format.py	`97.8% <83.33%> (-0.11%)`	⬇️
pandas/core/arrays/base.py	`98.52% <92%> (-0.92%)`	⬇️
pandas/core/internals/blocks.py	`94.42% <94.91%> (+0.04%)`	⬆️
... and 3 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b9b081d...06c4544. Read the comment docs.

TomAugspurger · 2019-06-18T13:40:08Z

What have you learned from starting the proof of concept? Do you think this is worth pursing?

jbrockmendel · 2019-06-18T16:20:51Z

What have you learned from starting the proof of concept?

Good question. In no particular order from my notes:

Implementing ReshapeMixin is straighforward
Patching any given EA method for the 2D cases is easy, but there are a bunch of them.
There are a lot of workarounds for the 1D/2D mix scattered through the code, not just in core.internals. This both means that a) it will be some work to track them all down and b) the simplification we get will not be limited to core.internals.
groupby and reshape code both reach into internals more than I expected/hoped.
Many try/except blocks in core.internals wrap a lot in the try so it isn't clear what failure modes are being caught.
We probably don't need both Block.astype and Block._astype, since the former just calls the latter.
indexing._NDFrameIndexer._setitem_with_indexer needs to be refactored into smaller chunks
super().foo doesn't work in the interpreter the same way super(whatever, self).foo did

Do you think this is worth pursing?

Yes. As described above, I think the complexity of internals is hiding a good amount of unintended behavior that this would help alleviate.

I'd like to find a nice way to break this into smaller pieces. Something like:

Implement+test all the 2D behavior for DTA/TDA/PA
Implement+test all the 2D behavior for Categorical
Implement+test all the 2D behavior for SparseArray (xref BUG: DataFrame[Sparse] quantile fails because SparseArray has no reshape #24600)
Implement+test all the 2D behavior for PandasArray
Make sure all of core.algorithms and core.nanops work correctly on the above (xref compat: PandasArray not accepted by many pandas functions #25018)
Implement a wrapper for downstream EA subclasses that cant use the _wrap_data pattern.
(Note at this point, we haven't changed anything in core.internals)
Transition core.internals to expect block.values.shape to always match block.shape
Likely multiple follow-up PRs to clean out no-longer-needed workarounds related to 2D/1D workarounds
- Remove Block.ndim attribute and constructor arg
- Remove Block.mgr_locs attribute, just place it on the BlockManager
(Not necessary, but possibly desired) Make all blocks backed by EAs
- Have Series constructor use pd.array
(Not necessary, but possibly desired) Have arithmetic dispatch to underlying EAs

TomAugspurger · 2019-06-18T16:38:47Z

Thanks, I'll digest all that later.

One more high-level question I forgot: How do 3rd party EAs complicate things? Do we require that they can be reshaped to a (N, 1) and (1, N) 2D arrays as well? Can we hide that complexity from them, or is inevitable that they worry about it? If the goals are

All blocks in a DataFrame are 2D
Block.shape == Block.values.shape

then it seems inevitable that 3rd party EAs need to be able to handle this (limited) reshaping.

jbrockmendel · 2019-06-18T17:37:24Z

How do 3rd party EAs complicate things? Do we require that they can be reshaped to a (N, 1) and (1, N) 2D arrays as well?

I expect that most of these will be able to use ReshapeMixin (e.g. cyberpandas' NumPyBackedExtensionArrayMixin looks like it can use it directly). For others, we either require downstream authors to implement it or could provide another wrapper that just defines reshape ops and is backed by a 1D EA.

jorisvandenbossche · 2019-06-19T19:32:10Z

Keep in mind that nothing here is asking the EA methods to work for non-1D cases. I patched a couple of DTA methods here for experimental purposes, but in principal we wouldn't have to do even that.

But you said a bit higher you prefer that the EA.take would handle its 2D case itself?

Yes. The more of the array ops are implemented on the arrays, the less have to be implemented on the Blocks. My preference is that we do more than the bare minimum.

So in the "more than minimum" case, the EA methods need to work for 2D?
I have the feeling your are contradicting yourself, but maybe I am just misunderstanding. In your ideal case, does the EA.take method need to take care of the 2D ( (N,1), (1,N)) cases or no?

I am just trying to get an understanding what the implications are for EAs.

Under the "bare minimum" approach, only the Block operations where ndim mismatch and transposing are the constraining factors.

Can you give an example of such operations?

TomAugspurger · 2019-06-19T19:36:09Z

I read that as

We can do more than the minimum for our internal EAs
We can provide a wrapper that translates Block[ThirdPartyEA].take(2d-indexer) to the correct ThirdPartyEA.take(1d-indexer) call, and then boxes the output appropriately.

though I may be projecting my own thoughts onto this.

TomAugspurger · 2019-06-19T19:39:01Z

Which of our test EAs would make for a good test-case for the wrapper / mixin / metaclass for EAs that can't be reshaped 2D? I think either arrow or json would be best.

edit: Sorry for the extraneous comment. Not saying that the wrapper for un-reshapable EAs should necessarily be done here.

jorisvandenbossche · 2019-06-19T20:21:22Z

But if we would keep all the 1D special casing for external EAs, doesn't this defeat the purpose of the simplication?
Also, in general, I am not sure it is a good idea to deviate how internal and external EAs are handled.

jbrockmendel · 2019-06-19T21:11:48Z

I read that as [...]

Tom communicates better than Joris and I put together.

We could apply the bare-minimum requirements for general EAs and go past that for our own EAs.

Which of our test EAs would make for a good test-case for the wrapper / mixin / metaclass for EAs that can't be reshaped 2D? I think either arrow or json would be best.

I like this idea, will take a look at it.

In your ideal case, does the EA.take method need to take care of the 2D ( (N,1), (1,N)) cases or no?

In my ideal case, EAs are a drop-in replacement for ndarray. In my preferred compromise case, the EA would patch take to handle these cases. In the compromise case currently in this PR, only __getitem__ is non-trivially patched.

Also, in general, I am not sure it is a good idea to deviate how internal and external EAs are handled.

This elides the fact that we already do handle them differently, otherwise we wouldn't need DatetimeTZBlock, CategoricalBlock, or ObjectValuesExtensionBlock. It strikes me as very unlikely that 3rd party EAs will Just Work without analogous patching (I hypothesize some of the many xfailed/skipped tests in tests/extensions/ are related to this)

jorisvandenbossche · 2019-06-19T21:45:52Z

We could apply the bare-minimum requirements for general EAs and go past that for our own EAs.

Can you detail what those bare-minimum requirements are? Is that the __getitem__ you mention later on?

This elides the fact that we already do handle them differently, otherwise we wouldn't need DatetimeTZBlock, CategoricalBlock, or ObjectValuesExtensionBlock.

I don't think that's a good comparison in all cases. The ObjectValuesExtensionBlock is a very minimal tweak to have a different public .values attribute. So basically period and interval are, except for that 1 think, fully relying on the ExtensionBlock at the moment.

It strikes me as very unlikely that 3rd party EAs will Just Work without analogous patching (I hypothesize some of the many xfailed/skipped tests in tests/extensions/ are related to this)

What do you mean with this? As far as I know, everything indeed Just Works for the things that makes sense to work, without having the need to patch something from the ExtensionBlock.

Would it make sense to summarize it very shortly that it would move some of the complexity dealing with a 1D/2D mixture from the BlockManager to the ExtensionBlock or ExtensionArray ?

jbrockmendel · 2019-06-19T22:33:59Z

I don't think that's a good comparison in all cases. The ObjectValuesExtensionBlock is a very minimal tweak to have a different public .values attribute. So basically period and interval are, except for that 1 think, fully relying on the ExtensionBlock at the moment.

Right, but you're choosing to focus on the lease relevant of the three examples given.

What do you mean with this? As far as I know, everything indeed Just Works for the things that makes sense to work, without having the need to patch something from the ExtensionBlock.

Not sure how much clearer I can make it: if ExtensionBlock contained everything needed to make EAs work, then DatetimeTZBlock and CategoricalBlock would be unnecessary.

Would it make sense to summarize it very shortly that it would move some of the complexity dealing with a 1D/2D mixture from the BlockManager to the ExtensionBlock or ExtensionArray ?

Close. I'd say a) the complexity is moved from ExtensionBlock (and subclasses) to the EAs, b) that there is a net decrease in complexity to be gained, and c) that the cost of complexity is convex, so getting it out of the blocks would be a win even if there were no net gains.

jbrockmendel · 2019-06-19T22:41:02Z

As far as I know, everything indeed Just Works for the things that makes sense to work, without having the need to patch something from the ExtensionBlock.

from pandas.tests.extension.arrow.bool import *

# Reproduce the `data` fixture from tests.extension.arrow.test_bool
data = ArrowBoolArray.from_scalars(np.random.randint(0, 2, size=100,
                                                         dtype=bool))
ser = pd.Series(data)

>>> ser
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pandas/core/series.py", line 1415, in __repr__
    max_rows=max_rows, length=show_dimensions)
  File "pandas/core/series.py", line 1459, in to_string
    max_rows=max_rows)
  File "pandas/io/formats/format.py", line 178, in __init__
    self._chk_truncate()
  File "pandas/io/formats/format.py", line 192, in _chk_truncate
    series.iloc[-row_num:]))
  File "pandas/core/reshape/concat.py", line 229, in concat
    return op.get_result()
  File "pandas/core/reshape/concat.py", line 393, in get_result
    self.new_axes)
  File "pandas/core/internals/managers.py", line 1594, in concat
    new_block = blocks[0].concat_same_type(blocks)
  File "pandas/core/internals/blocks.py", line 1790, in concat_same_type
    [blk.values for blk in to_concat])
TypeError: _concat_same_type() missing 1 required positional argument: 'to_concat'

... unless you want to argue that __repr__ isn't among "the things that makes sense to work"...

jorisvandenbossche · 2019-06-20T06:33:56Z

... unless you want to argue that repr isn't among "the things that makes sense to work"...

Checking the skipped test, you can see it has a clear "TODO". It's missing a required classmethod of the interface -> #26957

Right, but you're choosing to focus on the lease relevant of the three examples given.

It's used by period, interval, and sparse is also using ExtensionBlock. For sure those are the lesser used types, not not irrelevant. They still have much more testing internally than our base extension tests provide.

And I have always assumed that the long term goal of CategoricalBlock would be to become an actual ExtensionBlock, if not for all historical reasons that it has some deviating behaviour we need to preserve (datetime block is something different as there you have the mix of consolidatable tz naive and non-consolidatable tz aware).

Not sure how much clearer I can make it: if ExtensionBlock contained everything needed to make EAs work, then DatetimeTZBlock and CategoricalBlock would be unnecessary.

What I meant with what I said is: for me, from my experience in GeoPandas, everything seems to work for now. And if something does not work out (which probably will come up once it is released), then I consider that a bug or missing feature of the interface and report / fix it in pandas. I once started with a custom GeometryBlock, but the EA interface was exactly introduced to not need meddling with the blocks.

Long story short: to me, it is not "unlikely that 3rd party EAs will just work without analogous patching [of the blocks]", but rather that should be our goal.
And it is in that regard that I was saying: I am not sure it is a good idea that we start treating internal and external EAs more differently. Because that (as an external EA author) does not give me more confidence that external EAs will work as smoothly as our internal ones.

Idea: we could also keep our internal EAs (or part of them) 1D. If we want to give this possibility to external EAs, we need the 1D/2D compatibility code in ExtensionBlock anyway, so why not use it ourselves?

jbrockmendel · 2019-06-20T15:17:38Z

Checking the skipped test, you can see it has a clear "TODO".

I retract that example.

And it is in that regard that I was saying: I am not sure it is a good idea that we start treating internal and external EAs more differently. Because that (as an external EA author) does not give me more confidence that external EAs will work as smoothly as our internal ones.

Thank you for clarifying, this makes much more sense to me.

jbrockmendel · 2019-06-21T03:32:19Z

Updated to make changes less invasive where possible.

If my guess in #26976 is correct, then we can have DatetimeTZBlock use Block.take_nd instead of ExtensionBlock.take_nd`.

jbrockmendel · 2019-06-21T17:18:08Z

Did we reach anything resembling consensus on how to move forward here on the call? cc @jreback

jreback · 2019-06-21T18:00:10Z

@jbrockmendel my expectations should be that

internal & external EA's wont' actually have to change anything that is currently public, they will just simply work
we can / should use private methods in the arrays.base to facilitation this, e.g. _take2d, _reshape2d (or maybe _reshape_compat and the like)
we should change our internal block handling code to use these methods

this IMHO is much less magical than a decorator, its conceptually like the mixin you have proposed but less invasive.

I think that all of your aims can be achieved with this design (simplification and unification of block handling code for 1d/2d)

It maybe that we need to augement for example Categorical and/ro Datetime w/tz with methods / overrides in order to lose the actual Categorical / DatetimeTZ blocks; this IMHO should show us some gaps (possibly in the public part of the EA interface, but more likely in the 'private' part, meaning _take2d and friends).

TomAugspurger · 2019-06-21T18:07:17Z

One issue with the private methods like _take2d. That doesn't solve a root problem like Block.shape != Block.values.shape. If we're shifting more things from Block to Array that may not be such a big deal, but it's worth noting.

In terms of progressing, I'm having trouble reconciling the changes here with #26954. Personally, I'd like to see something like #26954 (with a class decorator being my desired approach) go in first, and then see what remains to be done here. But again, I'm having trouble seeing how these two should come together.

jreback · 2019-06-21T18:17:36Z

Would it be helpful to do this on a branch so see how things play out? certainly amenable to trying approaches to see what works

jbrockmendel · 2019-06-21T18:20:46Z

Personally, I'd like to see something like #26954 (with a class decorator being my desired approach) go in first, and then see what remains to be done here.

Right now I'm working on extending the proof of concept in #26954 to have Categorical use it to become 2D, will see what that breaks.

I'll also update it to use the suggested non-__init__-overriding implementation.

One issue with the private methods like _take2d. That doesn't solve a root problem like Block.shape != Block.values.shape

@jreback Tom is right, having block.shape == block.values.shape is a big part of where the net simplification comes from. We can offer downstream authors ways to minimize the effect on them (largely because we only need collike/rowlike and not "full" 2D support), but getting to literally-zero isn't likely to work.

Would it be helpful to do this on a branch so see how things play out? certainly amenable to trying approaches to see what works

I'll adapt #26954 into that.

TomAugspurger · 2019-06-21T18:24:57Z

Just FYI, I was playing with a class decorator the last 20 minutes and it's not 100% straightforward. I forgot when class decorators are run, so they only apply to the base class and not to subclasses. I think we'd need a metaclass.

jreback · 2019-06-21T18:27:32Z

block.shape == block.values.shape

and why can't we add a base class attribute (to EA base) so that we have (also renaming)
block._shape == block.values._shape

jbrockmendel added 23 commits June 13, 2019 18:07

Implement ReshapeMixin to allow datetimelike arrays to be 2D

bb1ea4a

Make DatetimeTZBlock.values have same ndim as the block

65dc544

Minimal follow-on edits to keep tests passing

checkpoint with only 18 failing tests

ccc49b4

Checkpoint with 13 failing

92a6ec7

passing

923cd7d

Merge branch 'master' of https://github.com/pandas-dev/pandas into d2arr

ab05bcd

passing

7fc8021

cleanup

03817c4

nothing broken

929d87b

cleanup

6f8eae3

Merge branch 'master' of https://github.com/pandas-dev/pandas into d2arr

729abbb

remove assertion

401a35e

remove unreachable

1d6b812

docstrings, remove need for swapaxes

2fe0c48

Merge branch 'master' of https://github.com/pandas-dev/pandas into d2arr

9f525fc

separate tests, use Block.interpolate

dd42b06

Merge branch 'master' of https://github.com/pandas-dev/pandas into d2arr

88fbd6a

repr tests, _unstack

089af0a

cleanup

84ebc36

cleanup

f789d02

cleanup

b37e347

Cleanup

a4b8505

fix sql tests

99931d4

Merge branch 'master' of https://github.com/pandas-dev/pandas into d2arr

3ff3380

jbrockmendel mentioned this pull request Jun 20, 2019

POC: 2D support for 1D ExtensionArray subclasses #26954

Closed

Merge branch 'master' of https://github.com/pandas-dev/pandas into d2arr

bbd098a

jorisvandenbossche mentioned this pull request Jun 20, 2019

TST: fix class method of test BoolArray #26957

Merged

Merge branch 'master' of https://github.com/pandas-dev/pandas into d2arr

ce9b464

jbrockmendel mentioned this pull request Jun 21, 2019

algos.take calling validate_indices incorrectly? #26976

Closed

jbrockmendel added 2 commits June 20, 2019 19:40

simplifications

9979c60

Merge branch 'master' of https://github.com/pandas-dev/pandas into d2arr

06c4544

gfyoung added Enhancement ExtensionArray Extending pandas with custom dtypes or arrays. labels Jun 26, 2019

jbrockmendel closed this Jul 1, 2019

jbrockmendel mentioned this pull request Jul 3, 2019

EA: support basic 2D operations #27142

Closed

jbrockmendel deleted the d2arr branch November 20, 2021 23:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP/POC: Allow (N, 1) and (1,N) EAs #26914

WIP/POC: Allow (N, 1) and (1,N) EAs #26914

jbrockmendel commented Jun 18, 2019

codecov bot commented Jun 18, 2019

codecov bot commented Jun 18, 2019 •

edited

Loading

TomAugspurger commented Jun 18, 2019

jbrockmendel commented Jun 18, 2019

TomAugspurger commented Jun 18, 2019

jbrockmendel commented Jun 18, 2019

jorisvandenbossche commented Jun 19, 2019

TomAugspurger commented Jun 19, 2019 •

edited

Loading

TomAugspurger commented Jun 19, 2019 •

edited

Loading

jorisvandenbossche commented Jun 19, 2019

jbrockmendel commented Jun 19, 2019

jorisvandenbossche commented Jun 19, 2019

jbrockmendel commented Jun 19, 2019

jbrockmendel commented Jun 19, 2019

jorisvandenbossche commented Jun 20, 2019 •

edited

Loading

jbrockmendel commented Jun 20, 2019

jbrockmendel commented Jun 21, 2019

jbrockmendel commented Jun 21, 2019

jreback commented Jun 21, 2019

TomAugspurger commented Jun 21, 2019

jreback commented Jun 21, 2019

jbrockmendel commented Jun 21, 2019

TomAugspurger commented Jun 21, 2019

jreback commented Jun 21, 2019

WIP/POC: Allow (N, 1) and (1,N) EAs #26914

WIP/POC: Allow (N, 1) and (1,N) EAs #26914

Conversation

jbrockmendel commented Jun 18, 2019

codecov bot commented Jun 18, 2019

Codecov Report

codecov bot commented Jun 18, 2019 • edited Loading

Codecov Report

TomAugspurger commented Jun 18, 2019

jbrockmendel commented Jun 18, 2019

TomAugspurger commented Jun 18, 2019

jbrockmendel commented Jun 18, 2019

jorisvandenbossche commented Jun 19, 2019

TomAugspurger commented Jun 19, 2019 • edited Loading

TomAugspurger commented Jun 19, 2019 • edited Loading

jorisvandenbossche commented Jun 19, 2019

jbrockmendel commented Jun 19, 2019

jorisvandenbossche commented Jun 19, 2019

jbrockmendel commented Jun 19, 2019

jbrockmendel commented Jun 19, 2019

jorisvandenbossche commented Jun 20, 2019 • edited Loading

jbrockmendel commented Jun 20, 2019

jbrockmendel commented Jun 21, 2019

jbrockmendel commented Jun 21, 2019

jreback commented Jun 21, 2019

TomAugspurger commented Jun 21, 2019

jreback commented Jun 21, 2019

jbrockmendel commented Jun 21, 2019

TomAugspurger commented Jun 21, 2019

jreback commented Jun 21, 2019

codecov bot commented Jun 18, 2019 •

edited

Loading

TomAugspurger commented Jun 19, 2019 •

edited

Loading

TomAugspurger commented Jun 19, 2019 •

edited

Loading

jorisvandenbossche commented Jun 20, 2019 •

edited

Loading