Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REF: back IntervalArray by a single ndarray #37047

Merged
merged 9 commits into from
Oct 12, 2020

Conversation

jbrockmendel
Copy link
Member

  • closes #xxxx
  • tests added / passed
  • passes black pandas
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry

The main motivation here is for perf for methods that current cast to object dtype, including value_counts, set ops, values_for_(argsort|factorize)

Also we get an actually-simple simple_new

pandas/core/arrays/interval.py Show resolved Hide resolved
@@ -1399,3 +1344,78 @@ def maybe_convert_platform_interval(values):
values = np.asarray(values)

return maybe_convert_platform(values)


def _maybe_cast_inputs(left, right, copy, dtype):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

type if you can (esp the return)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do

return left, right


def _get_combined_data(left, right):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

type if you can esp the return

@jreback jreback added Interval Interval data type Refactor Internal refactoring of code labels Oct 11, 2020


def _get_combined_data(
left: Union["Index", ArrayLike], right: Union["Index", ArrayLike]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this method be more strict? e.g. only accept Union[np.ndarray, "DatetimeArray", "TimedeltaArray"]

e.g. things that have already been casted?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we can do quite a bit less back-and-forth casting eventually, yes

@jreback jreback added this to the 1.2 milestone Oct 12, 2020
@jreback
Copy link
Contributor

jreback commented Oct 12, 2020

cc @jschendel @TomAugspurger if you want to look

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Oct 12, 2020 via email

@jreback jreback merged commit 9cb3723 into pandas-dev:master Oct 12, 2020
@jreback
Copy link
Contributor

jreback commented Oct 12, 2020

thanks @jbrockmendel

@jbrockmendel jbrockmendel deleted the ref-ia-ii-3 branch October 13, 2020 00:44
@jorisvandenbossche
Copy link
Member

The consequence of this is that we can never use another ExtensionArray as the backer of an IntervalArray, right? (except for the DatetimeArray for which this is an internal special case already)
I am not sure we want to make this choice, even though right now we don't have any other EA-backed intervals?

Also, @jbrockmendel you mention this change is mainly for perf reasons. What is the amount of speed-up that this gives? Can you give a few examples? (are there benchmarks for this?)

@jreback
Copy link
Contributor

jreback commented Oct 13, 2020

The consequence of this is that we can never use another ExtensionArray as the backer of an IntervalArray, right? (except for the DatetimeArray for which this is an internal special case already)

I am not sure we want to make this choice, even though right now we don't have any other EA-backed intervals?

Also, @jbrockmendel you mention this change is mainly for perf reasons. What is the amount of speed-up that this gives? Can you give a few examples? (are there benchmarks for this?)

huh why make things more complicated ? what reason do you have for not dojng this?
we want simpler not more complicated - it's good to be opionates

@jorisvandenbossche
Copy link
Member

I think the model we have had for years (storing a left and right array) is actually quite simple, and maps to the mental model of an IntervalArray as a struct array.
(in the end, storing 2 1D array or a single 2D arrays is really not that much of a difference conceptually)

(and as a sidenote @jreback, some feedback on how we are discussing things: words like "huh" (which give the impression to me as if what I said is complete nonsense) is not really helpfull for a good discussion)

@jreback
Copy link
Contributor

jreback commented Oct 13, 2020

i think huh is pretty telling actually - you are making a late comment because on a PR because
of what? you think that the original implementation is good? i don't understand anything from your original comment

so i don't this is helpful at all and the huh is confusion because of the commen

@jorisvandenbossche
Copy link
Member

(then you can also say: "I am confused" instead of "huh?". Maybe for you it is the same, but so I am giving you the feedback that this phrasing is giving me a very negative feeling. Even if you are confused why I am commenting on a closed PR, I think you can say that in a different way)

Yes, this PR is already merged, but since when does that mean I cannot have comments anymore about the change? (the PR was also open for not even 2 days, without a prior issue discussing this. That's not leaving much time to comment on it while it was open)


But so about the actual topic:

what reason do you have for not doing this?

To repeat what I said in my original comment: this means "we can never use another ExtensionArray as the backer of an IntervalArray".

Of course, right now, we only use numpy arrays as the underlying values for IntervalArray (also because we don't yet have an EA-backed Index). So there is not a direct issue at this moment with this change.

But I personally think it is reasonable to consider allowing IntervalArray to be backed by both ndarray or EA.
Actually, if we would remove the coercing to object dtype for EAs / dtype validation in _simple_new, I think a large part of the IntervalArray implementation would actually right now already work with EAs (IntervalIndex with the tree engine is something else of course, but indexing engines with EAs is a general topic we still need to solve).

And why would we want to have IntervalArray backed by EAs?
First, because we have the extension mechanism of EAs, and supporting extension dtypes in IntervalArray as well could be seen as a way to further complete the EA support throughout pandas.
More specifically: what are we going to do with our own numeric EAs (integer dtype, floating dtype, etc)? Right now those are not yet integrated with IntervalArray, but at some point we will want to be able to create an IntervalArray from a nullable integer column, that preserves the nullable dtype (eg when accessing left/right, or mid). There are of course other ways to implement this, but I think as a default storing the EA as left/right array is the easiest solution.

@TomAugspurger
Copy link
Contributor

Ah that's a fair point about other extension types I hadn't considered.

IIUC, the main reason for this PR was to clean up the (hacky) code to support IntervalArray.__setitem__. I thought that there was more discussion around performance, but maybe I'm thinking of another issue. I guess I was thinking of #36310.

@jreback
Copy link
Contributor

jreback commented Oct 13, 2020

I think this is a better representation actually of the primary usecases of IntervalArrays, until / unless we have another one this is one that keeps consistency of dtypes which has plagued since the beginning. This advocates for 2E EA (@jbrockmendel will be happy).

THe main point though is the mental model is much simpler. This is a collection of Interval points that are joined exactly; 2 independent arrays are not, you have lots of checking are they they same dtype, are they the same length and so on. this is simpler.

Let's not build more than we need now.

@jorisvandenbossche
Copy link
Member

this is one that keeps consistency of dtypes which has plagued since the beginning.

Not fully sure what you mean here. Can you give an example of a past issue releted to this?

THe main point though is the mental model is much simpler. This is a collection of Interval points that are joined exactly; 2 independent arrays are not, you have lots of checking are they they same dtype, are they the same length and so on. this is simpler.

Well, the current code to deal with that was mostly focused in a single function, i.e. _simple_new. And all that code is still needed after this PR (only moved to a new helper function), because we still allow to create an IntervalArray from two arrays (which needs all this checking upon construction).

Whether a single 2D array is "much simpler" is clearly subjective then, but anyway, my main worry this related to EA support:

Let's not build more than we need now.

We actually need it now. For sure, there is not an open PR or issue about supporting EAs in IntervalArray right now. But we are still actively working on making the nullable dtypes full replacements for their numpy counterparts. Support in IntervalArray is for me part of that "full support".
So yes, I think we should think about how this can fit in with this refactor now, before doing any further work on this code.

@jbrockmendel
Copy link
Member Author

you mention this change is mainly for perf reasons. What is the amount of speed-up that this gives? Can you give a few examples? (are there benchmarks for this?)

Not on hand, no.

But we are still actively working on making the nullable dtypes full replacements for their numpy counterparts

full replacements for numpy counterparts would support 2D.

@jorisvandenbossche
Copy link
Member

full replacements for numpy counterparts would support 2D.

Not necessarily for pandas functionality, though

But that's not the actual discussion here. To repeat again: I think we should be able to have an IntervalArray using EA (eg to support IntervalArray with nullable int/float). How do we do that? (on the short term)

And yes, 2D EAs would be one way. But that's a much bigger discussion, though. So if someone wants to have that discussion, I would suggest to open a separate issue for that.
But I think storing left/right as 1D arrays (as we have always done up to now) is the much easier solution on the short term.

JulianWgs pushed a commit to JulianWgs/pandas that referenced this pull request Oct 26, 2020
@jorisvandenbossche
Copy link
Member

What the status of this discussion?
If I recall correctly, on the dev meeting two weeks ago it was said that this might enable performance benefits (and that @jbrockmendel you would try an example perf improvement PR?), and that otherwise there would be no objection to reverting it (is that correct?)

Since I haven't seen further attempts to use this for performance improvements, shall we revert it?

@jbrockmendel
Copy link
Member Author

you would try an example perf improvement PR

thanks for the reminder; ill move this up the priority list

kesmit13 pushed a commit to kesmit13/pandas that referenced this pull request Nov 2, 2020
@jorisvandenbossche
Copy link
Member

Any update here? Otherwise I would propose to revert this change.

@jbrockmendel
Copy link
Member Author

I've got a branch going on the set ops, but it has devolved into yak shaving. There is code simplification available unrelated to perf, but on the call I agreed not to move forward with any of that in the short-term.

@jorisvandenbossche
Copy link
Member

Another ping here. If there are no plans for actual improvements based on this change, are you OK with reverting it?

@jbrockmendel
Copy link
Member Author

Do code simplifications count as improvements?

The stated reason for wanting to revert is to allow IntegerArray/FloatArray to back it. Are there plans to do that in the near future?

@jorisvandenbossche
Copy link
Member

Do code simplifications count as improvements?

In principle certainly yes, but in this case I think it is quite subjective. As I said above, I personally don't find this much of a simplification. But it is clear that there is disagreement about this ;)

The stated reason for wanting to revert is to allow IntegerArray/FloatArray to back it. Are there plans to do that in the near future?

As mentioned above: there is no concrete open issue or PR for this, but we are actively working towards full support for the nullable dtypes so they can become the default in a future release. While working on this the coming months, supporting them in IntervalArray will be one of the issues we need to tackle. So yes, there are IMO plans to do this in the near future.

@jreback
Copy link
Contributor

jreback commented Nov 18, 2020

was wavering in this but i agree with @jorisvandenbossche logic here for reverting

@jbrockmendel
Copy link
Member Author

OK. easy to un-revert once either a) i finish the yak shaving or b) nullable dtypes support 2D.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Interval Interval data type Refactor Internal refactoring of code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants