Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP/POC: Allow (N, 1) and (1,N) EAs #26914

Closed
wants to merge 30 commits into from

Conversation

jbrockmendel
Copy link
Member

Discussed on the mailing list last week, cc @TomAugspurger @jreback @jorisvandenbossche

The central claim: if we allow EAs to take on shapes (N, 1) and (1, N), then we can get some major simplifications in core.internals.

For the proof of concept I've made the Datetimelike arrays accept these shapes and updated DatetimeTZBlock to see what simplifications are available. The _wrap_data implementation would also work for Categorical and PandasArray.

Three categories of simplifications are not in here yet and you'll have to use your imaginations

  • some can only be made once all EAs allow these shapes, e.g. not having to pass ndim arg to Block constructors
  • some would require additional tweaks on the EA subclasses, e.g. making shift, __iter__, or take aware of 2Dness
  • some I just haven't got to yet. core.internals is rough.

Added bonus: this fleshes out bugs (see #26864)

@codecov
Copy link

codecov bot commented Jun 18, 2019

Codecov Report

Merging #26914 into master will decrease coverage by 0.01%.
The diff coverage is 92.56%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #26914      +/-   ##
==========================================
- Coverage   91.87%   91.85%   -0.02%     
==========================================
  Files         180      180              
  Lines       50712    50792      +80     
==========================================
+ Hits        46590    46657      +67     
- Misses       4122     4135      +13
Flag Coverage Δ
#multiple 90.45% <92.56%> (-0.01%) ⬇️
#single 41.12% <50.41%> (-0.06%) ⬇️
Impacted Files Coverage Δ
pandas/core/arrays/datetimelike.py 97.95% <100%> (+0.02%) ⬆️
pandas/core/arrays/timedeltas.py 88.82% <100%> (ø) ⬆️
pandas/core/arrays/datetimes.py 97.79% <100%> (ø) ⬆️
pandas/core/groupby/ops.py 96% <100%> (ø) ⬆️
pandas/core/dtypes/concat.py 96.6% <100%> (+0.01%) ⬆️
pandas/core/internals/managers.py 95.21% <100%> (ø) ⬆️
pandas/core/internals/concat.py 96.51% <100%> (+0.02%) ⬆️
pandas/io/formats/format.py 97.8% <83.33%> (-0.11%) ⬇️
pandas/core/internals/blocks.py 94.25% <90.9%> (-0.13%) ⬇️
pandas/core/arrays/base.py 98.52% <92%> (-0.92%) ⬇️
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update baa77c3...99931d4. Read the comment docs.

@codecov
Copy link

codecov bot commented Jun 18, 2019

Codecov Report

Merging #26914 into master will decrease coverage by <.01%.
The diff coverage is 94.95%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #26914      +/-   ##
==========================================
- Coverage   91.97%   91.97%   -0.01%     
==========================================
  Files         180      180              
  Lines       50756    50833      +77     
==========================================
+ Hits        46685    46753      +68     
- Misses       4071     4080       +9
Flag Coverage Δ
#multiple 90.57% <94.95%> (ø) ⬆️
#single 41.84% <51.26%> (-0.07%) ⬇️
Impacted Files Coverage Δ
pandas/core/dtypes/concat.py 96.58% <ø> (ø) ⬆️
pandas/core/arrays/datetimelike.py 97.95% <100%> (+0.03%) ⬆️
pandas/core/arrays/timedeltas.py 88.82% <100%> (ø) ⬆️
pandas/core/arrays/datetimes.py 97.81% <100%> (+0.01%) ⬆️
pandas/core/groupby/ops.py 96% <100%> (ø) ⬆️
pandas/core/internals/managers.py 96% <100%> (ø) ⬆️
pandas/core/internals/concat.py 96.51% <100%> (+0.02%) ⬆️
pandas/io/formats/format.py 97.8% <83.33%> (-0.11%) ⬇️
pandas/core/arrays/base.py 98.52% <92%> (-0.92%) ⬇️
pandas/core/internals/blocks.py 94.42% <94.91%> (+0.04%) ⬆️
... and 3 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b9b081d...06c4544. Read the comment docs.

@TomAugspurger
Copy link
Contributor

What have you learned from starting the proof of concept? Do you think this is worth pursing?

@jbrockmendel
Copy link
Member Author

What have you learned from starting the proof of concept?

Good question. In no particular order from my notes:

  • Implementing ReshapeMixin is straighforward
  • Patching any given EA method for the 2D cases is easy, but there are a bunch of them.
  • There are a lot of workarounds for the 1D/2D mix scattered through the code, not just in core.internals. This both means that a) it will be some work to track them all down and b) the simplification we get will not be limited to core.internals.
  • groupby and reshape code both reach into internals more than I expected/hoped.
  • Many try/except blocks in core.internals wrap a lot in the try so it isn't clear what failure modes are being caught.
  • We probably don't need both Block.astype and Block._astype, since the former just calls the latter.
  • indexing._NDFrameIndexer._setitem_with_indexer needs to be refactored into smaller chunks
  • super().foo doesn't work in the interpreter the same way super(whatever, self).foo did

Do you think this is worth pursing?

Yes. As described above, I think the complexity of internals is hiding a good amount of unintended behavior that this would help alleviate.

I'd like to find a nice way to break this into smaller pieces. Something like:

  • Implement+test all the 2D behavior for DTA/TDA/PA
  • Implement+test all the 2D behavior for Categorical
  • Implement+test all the 2D behavior for SparseArray (xref BUG: DataFrame[Sparse] quantile fails because SparseArray has no reshape  #24600)
  • Implement+test all the 2D behavior for PandasArray
  • Make sure all of core.algorithms and core.nanops work correctly on the above (xref compat: PandasArray not accepted by many pandas functions #25018)
  • Implement a wrapper for downstream EA subclasses that cant use the _wrap_data pattern.
    (Note at this point, we haven't changed anything in core.internals)
  • Transition core.internals to expect block.values.shape to always match block.shape
  • Likely multiple follow-up PRs to clean out no-longer-needed workarounds related to 2D/1D workarounds
    • Remove Block.ndim attribute and constructor arg
    • Remove Block.mgr_locs attribute, just place it on the BlockManager
  • (Not necessary, but possibly desired) Make all blocks backed by EAs
    • Have Series constructor use pd.array
  • (Not necessary, but possibly desired) Have arithmetic dispatch to underlying EAs

@TomAugspurger
Copy link
Contributor

Thanks, I'll digest all that later.

One more high-level question I forgot: How do 3rd party EAs complicate things? Do we require that they can be reshaped to a (N, 1) and (1, N) 2D arrays as well? Can we hide that complexity from them, or is inevitable that they worry about it? If the goals are

  1. All blocks in a DataFrame are 2D
  2. Block.shape == Block.values.shape

then it seems inevitable that 3rd party EAs need to be able to handle this (limited) reshaping.

@jbrockmendel
Copy link
Member Author

How do 3rd party EAs complicate things? Do we require that they can be reshaped to a (N, 1) and (1, N) 2D arrays as well?

I expect that most of these will be able to use ReshapeMixin (e.g. cyberpandas' NumPyBackedExtensionArrayMixin looks like it can use it directly). For others, we either require downstream authors to implement it or could provide another wrapper that just defines reshape ops and is backed by a 1D EA.

@jorisvandenbossche
Copy link
Member

Keep in mind that nothing here is asking the EA methods to work for non-1D cases. I patched a couple of DTA methods here for experimental purposes, but in principal we wouldn't have to do even that.

But you said a bit higher you prefer that the EA.take would handle its 2D case itself?

Yes. The more of the array ops are implemented on the arrays, the less have to be implemented on the Blocks. My preference is that we do more than the bare minimum.

So in the "more than minimum" case, the EA methods need to work for 2D?
I have the feeling your are contradicting yourself, but maybe I am just misunderstanding. In your ideal case, does the EA.take method need to take care of the 2D ( (N,1), (1,N)) cases or no?

I am just trying to get an understanding what the implications are for EAs.

Under the "bare minimum" approach, only the Block operations where ndim mismatch and transposing are the constraining factors.

Can you give an example of such operations?

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jun 19, 2019

I read that as

  1. We can do more than the minimum for our internal EAs
  2. We can provide a wrapper that translates Block[ThirdPartyEA].take(2d-indexer) to the correct ThirdPartyEA.take(1d-indexer) call, and then boxes the output appropriately.

though I may be projecting my own thoughts onto this.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jun 19, 2019

Which of our test EAs would make for a good test-case for the wrapper / mixin / metaclass for EAs that can't be reshaped 2D? I think either arrow or json would be best.

edit: Sorry for the extraneous comment. Not saying that the wrapper for un-reshapable EAs should necessarily be done here.

@jorisvandenbossche
Copy link
Member

But if we would keep all the 1D special casing for external EAs, doesn't this defeat the purpose of the simplication?
Also, in general, I am not sure it is a good idea to deviate how internal and external EAs are handled.

@jbrockmendel
Copy link
Member Author

I read that as [...]

Tom communicates better than Joris and I put together.

We could apply the bare-minimum requirements for general EAs and go past that for our own EAs.

Which of our test EAs would make for a good test-case for the wrapper / mixin / metaclass for EAs that can't be reshaped 2D? I think either arrow or json would be best.

I like this idea, will take a look at it.

In your ideal case, does the EA.take method need to take care of the 2D ( (N,1), (1,N)) cases or no?

In my ideal case, EAs are a drop-in replacement for ndarray. In my preferred compromise case, the EA would patch take to handle these cases. In the compromise case currently in this PR, only __getitem__ is non-trivially patched.

Also, in general, I am not sure it is a good idea to deviate how internal and external EAs are handled.

This elides the fact that we already do handle them differently, otherwise we wouldn't need DatetimeTZBlock, CategoricalBlock, or ObjectValuesExtensionBlock. It strikes me as very unlikely that 3rd party EAs will Just Work without analogous patching (I hypothesize some of the many xfailed/skipped tests in tests/extensions/ are related to this)

@jorisvandenbossche
Copy link
Member

We could apply the bare-minimum requirements for general EAs and go past that for our own EAs.

Can you detail what those bare-minimum requirements are? Is that the __getitem__ you mention later on?

This elides the fact that we already do handle them differently, otherwise we wouldn't need DatetimeTZBlock, CategoricalBlock, or ObjectValuesExtensionBlock.

I don't think that's a good comparison in all cases. The ObjectValuesExtensionBlock is a very minimal tweak to have a different public .values attribute. So basically period and interval are, except for that 1 think, fully relying on the ExtensionBlock at the moment.

It strikes me as very unlikely that 3rd party EAs will Just Work without analogous patching (I hypothesize some of the many xfailed/skipped tests in tests/extensions/ are related to this)

What do you mean with this? As far as I know, everything indeed Just Works for the things that makes sense to work, without having the need to patch something from the ExtensionBlock.


Would it make sense to summarize it very shortly that it would move some of the complexity dealing with a 1D/2D mixture from the BlockManager to the ExtensionBlock or ExtensionArray ?

@jbrockmendel
Copy link
Member Author

I don't think that's a good comparison in all cases. The ObjectValuesExtensionBlock is a very minimal tweak to have a different public .values attribute. So basically period and interval are, except for that 1 think, fully relying on the ExtensionBlock at the moment.

Right, but you're choosing to focus on the lease relevant of the three examples given.

What do you mean with this? As far as I know, everything indeed Just Works for the things that makes sense to work, without having the need to patch something from the ExtensionBlock.

Not sure how much clearer I can make it: if ExtensionBlock contained everything needed to make EAs work, then DatetimeTZBlock and CategoricalBlock would be unnecessary.

Would it make sense to summarize it very shortly that it would move some of the complexity dealing with a 1D/2D mixture from the BlockManager to the ExtensionBlock or ExtensionArray ?

Close. I'd say a) the complexity is moved from ExtensionBlock (and subclasses) to the EAs, b) that there is a net decrease in complexity to be gained, and c) that the cost of complexity is convex, so getting it out of the blocks would be a win even if there were no net gains.

@jbrockmendel
Copy link
Member Author

As far as I know, everything indeed Just Works for the things that makes sense to work, without having the need to patch something from the ExtensionBlock.

from pandas.tests.extension.arrow.bool import *

# Reproduce the `data` fixture from tests.extension.arrow.test_bool
data = ArrowBoolArray.from_scalars(np.random.randint(0, 2, size=100,
                                                         dtype=bool))
ser = pd.Series(data)

>>> ser
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pandas/core/series.py", line 1415, in __repr__
    max_rows=max_rows, length=show_dimensions)
  File "pandas/core/series.py", line 1459, in to_string
    max_rows=max_rows)
  File "pandas/io/formats/format.py", line 178, in __init__
    self._chk_truncate()
  File "pandas/io/formats/format.py", line 192, in _chk_truncate
    series.iloc[-row_num:]))
  File "pandas/core/reshape/concat.py", line 229, in concat
    return op.get_result()
  File "pandas/core/reshape/concat.py", line 393, in get_result
    self.new_axes)
  File "pandas/core/internals/managers.py", line 1594, in concat
    new_block = blocks[0].concat_same_type(blocks)
  File "pandas/core/internals/blocks.py", line 1790, in concat_same_type
    [blk.values for blk in to_concat])
TypeError: _concat_same_type() missing 1 required positional argument: 'to_concat'

... unless you want to argue that __repr__ isn't among "the things that makes sense to work"...

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Jun 20, 2019

... unless you want to argue that repr isn't among "the things that makes sense to work"...

Checking the skipped test, you can see it has a clear "TODO". It's missing a required classmethod of the interface -> #26957

Right, but you're choosing to focus on the lease relevant of the three examples given.

It's used by period, interval, and sparse is also using ExtensionBlock. For sure those are the lesser used types, not not irrelevant. They still have much more testing internally than our base extension tests provide.

And I have always assumed that the long term goal of CategoricalBlock would be to become an actual ExtensionBlock, if not for all historical reasons that it has some deviating behaviour we need to preserve (datetime block is something different as there you have the mix of consolidatable tz naive and non-consolidatable tz aware).

Not sure how much clearer I can make it: if ExtensionBlock contained everything needed to make EAs work, then DatetimeTZBlock and CategoricalBlock would be unnecessary.

What I meant with what I said is: for me, from my experience in GeoPandas, everything seems to work for now. And if something does not work out (which probably will come up once it is released), then I consider that a bug or missing feature of the interface and report / fix it in pandas. I once started with a custom GeometryBlock, but the EA interface was exactly introduced to not need meddling with the blocks.

Long story short: to me, it is not "unlikely that 3rd party EAs will just work without analogous patching [of the blocks]", but rather that should be our goal.
And it is in that regard that I was saying: I am not sure it is a good idea that we start treating internal and external EAs more differently. Because that (as an external EA author) does not give me more confidence that external EAs will work as smoothly as our internal ones.


Idea: we could also keep our internal EAs (or part of them) 1D. If we want to give this possibility to external EAs, we need the 1D/2D compatibility code in ExtensionBlock anyway, so why not use it ourselves?

@jbrockmendel
Copy link
Member Author

Checking the skipped test, you can see it has a clear "TODO".

I retract that example.

And it is in that regard that I was saying: I am not sure it is a good idea that we start treating internal and external EAs more differently. Because that (as an external EA author) does not give me more confidence that external EAs will work as smoothly as our internal ones.

Thank you for clarifying, this makes much more sense to me.

@jbrockmendel
Copy link
Member Author

Updated to make changes less invasive where possible.

If my guess in #26976 is correct, then we can have DatetimeTZBlock use Block.take_nd instead of ExtensionBlock.take_nd`.

@jbrockmendel
Copy link
Member Author

Did we reach anything resembling consensus on how to move forward here on the call? cc @jreback

@jreback
Copy link
Contributor

jreback commented Jun 21, 2019

@jbrockmendel my expectations should be that

  • internal & external EA's wont' actually have to change anything that is currently public, they will just simply work
  • we can / should use private methods in the arrays.base to facilitation this, e.g. _take2d, _reshape2d (or maybe _reshape_compat and the like)
  • we should change our internal block handling code to use these methods

this IMHO is much less magical than a decorator, its conceptually like the mixin you have proposed but less invasive.

I think that all of your aims can be achieved with this design (simplification and unification of block handling code for 1d/2d)

It maybe that we need to augement for example Categorical and/ro Datetime w/tz with methods / overrides in order to lose the actual Categorical / DatetimeTZ blocks; this IMHO should show us some gaps (possibly in the public part of the EA interface, but more likely in the 'private' part, meaning _take2d and friends).

@TomAugspurger
Copy link
Contributor

One issue with the private methods like _take2d. That doesn't solve a root problem like Block.shape != Block.values.shape. If we're shifting more things from Block to Array that may not be such a big deal, but it's worth noting.


In terms of progressing, I'm having trouble reconciling the changes here with #26954. Personally, I'd like to see something like #26954 (with a class decorator being my desired approach) go in first, and then see what remains to be done here. But again, I'm having trouble seeing how these two should come together.

@jreback
Copy link
Contributor

jreback commented Jun 21, 2019

Would it be helpful to do this on a branch so see how things play out? certainly amenable to trying approaches to see what works

@jbrockmendel
Copy link
Member Author

Personally, I'd like to see something like #26954 (with a class decorator being my desired approach) go in first, and then see what remains to be done here.

Right now I'm working on extending the proof of concept in #26954 to have Categorical use it to become 2D, will see what that breaks.

I'll also update it to use the suggested non-__init__-overriding implementation.

One issue with the private methods like _take2d. That doesn't solve a root problem like Block.shape != Block.values.shape

@jreback Tom is right, having block.shape == block.values.shape is a big part of where the net simplification comes from. We can offer downstream authors ways to minimize the effect on them (largely because we only need collike/rowlike and not "full" 2D support), but getting to literally-zero isn't likely to work.

Would it be helpful to do this on a branch so see how things play out? certainly amenable to trying approaches to see what works

I'll adapt #26954 into that.

@TomAugspurger
Copy link
Contributor

Just FYI, I was playing with a class decorator the last 20 minutes and it's not 100% straightforward. I forgot when class decorators are run, so they only apply to the base class and not to subclasses. I think we'd need a metaclass.

@jreback
Copy link
Contributor

jreback commented Jun 21, 2019

block.shape == block.values.shape

and why can't we add a base class attribute (to EA base) so that we have (also renaming)
block._shape == block.values._shape

@gfyoung gfyoung added Enhancement ExtensionArray Extending pandas with custom dtypes or arrays. labels Jun 26, 2019
@jbrockmendel jbrockmendel deleted the d2arr branch November 20, 2021 23:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement ExtensionArray Extending pandas with custom dtypes or arrays.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants