REF: back IntervalArray by a single ndarray #37047

jbrockmendel · 2020-10-11T02:43:36Z

The main motivation here is for perf for methods that current cast to object dtype, including value_counts, set ops, values_for_(argsort|factorize)

Also we get an actually-simple simple_new

…f-ia-ii-3

pandas/core/arrays/interval.py

jreback · 2020-10-11T16:31:12Z

pandas/core/arrays/interval.py

@@ -1399,3 +1344,78 @@ def maybe_convert_platform_interval(values):
        values = np.asarray(values)

    return maybe_convert_platform(values)
+
+
+def _maybe_cast_inputs(left, right, copy, dtype):


type if you can (esp the return)

jreback · 2020-10-11T16:31:28Z

pandas/core/arrays/interval.py

+    return left, right
+
+
+def _get_combined_data(left, right):


type if you can esp the return

…f-ia-ii-3

jreback · 2020-10-11T17:58:53Z

pandas/core/arrays/interval.py

+
+
+def _get_combined_data(
+    left: Union["Index", ArrayLike], right: Union["Index", ArrayLike]


can this method be more strict? e.g. only accept Union[np.ndarray, "DatetimeArray", "TimedeltaArray"]

e.g. things that have already been casted?

i think we can do quite a bit less back-and-forth casting eventually, yes

…f-ia-ii-3

jreback · 2020-10-12T15:19:13Z

cc @jschendel @TomAugspurger if you want to look

TomAugspurger · 2020-10-12T15:22:42Z

I think things looked fine last I checked.

…

On Mon, Oct 12, 2020 at 10:19 AM Jeff Reback ***@***.***> wrote: cc @jschendel <https://github.com/jschendel> @TomAugspurger <https://github.com/TomAugspurger> if you want to look — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#37047 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAKAOIXWO4HOX6ZZUHVTPELSKMNAHANCNFSM4SLO3GZQ> .

jreback · 2020-10-12T23:28:52Z

thanks @jbrockmendel

jorisvandenbossche · 2020-10-13T10:05:42Z

The consequence of this is that we can never use another ExtensionArray as the backer of an IntervalArray, right? (except for the DatetimeArray for which this is an internal special case already)
I am not sure we want to make this choice, even though right now we don't have any other EA-backed intervals?

Also, @jbrockmendel you mention this change is mainly for perf reasons. What is the amount of speed-up that this gives? Can you give a few examples? (are there benchmarks for this?)

jreback · 2020-10-13T10:39:32Z

The consequence of this is that we can never use another ExtensionArray as the backer of an IntervalArray, right? (except for the DatetimeArray for which this is an internal special case already)

I am not sure we want to make this choice, even though right now we don't have any other EA-backed intervals?

Also, @jbrockmendel you mention this change is mainly for perf reasons. What is the amount of speed-up that this gives? Can you give a few examples? (are there benchmarks for this?)

huh why make things more complicated ? what reason do you have for not dojng this?
we want simpler not more complicated - it's good to be opionates

jorisvandenbossche · 2020-10-13T11:08:02Z

I think the model we have had for years (storing a left and right array) is actually quite simple, and maps to the mental model of an IntervalArray as a struct array.
(in the end, storing 2 1D array or a single 2D arrays is really not that much of a difference conceptually)

(and as a sidenote @jreback, some feedback on how we are discussing things: words like "huh" (which give the impression to me as if what I said is complete nonsense) is not really helpfull for a good discussion)

jreback · 2020-10-13T11:14:22Z

i think huh is pretty telling actually - you are making a late comment because on a PR because
of what? you think that the original implementation is good? i don't understand anything from your original comment

so i don't this is helpful at all and the huh is confusion because of the commen

jorisvandenbossche · 2020-10-13T11:49:40Z

(then you can also say: "I am confused" instead of "huh?". Maybe for you it is the same, but so I am giving you the feedback that this phrasing is giving me a very negative feeling. Even if you are confused why I am commenting on a closed PR, I think you can say that in a different way)

Yes, this PR is already merged, but since when does that mean I cannot have comments anymore about the change? (the PR was also open for not even 2 days, without a prior issue discussing this. That's not leaving much time to comment on it while it was open)

But so about the actual topic:

what reason do you have for not doing this?

To repeat what I said in my original comment: this means "we can never use another ExtensionArray as the backer of an IntervalArray".

Of course, right now, we only use numpy arrays as the underlying values for IntervalArray (also because we don't yet have an EA-backed Index). So there is not a direct issue at this moment with this change.

But I personally think it is reasonable to consider allowing IntervalArray to be backed by both ndarray or EA.
Actually, if we would remove the coercing to object dtype for EAs / dtype validation in _simple_new, I think a large part of the IntervalArray implementation would actually right now already work with EAs (IntervalIndex with the tree engine is something else of course, but indexing engines with EAs is a general topic we still need to solve).

And why would we want to have IntervalArray backed by EAs?
First, because we have the extension mechanism of EAs, and supporting extension dtypes in IntervalArray as well could be seen as a way to further complete the EA support throughout pandas.
More specifically: what are we going to do with our own numeric EAs (integer dtype, floating dtype, etc)? Right now those are not yet integrated with IntervalArray, but at some point we will want to be able to create an IntervalArray from a nullable integer column, that preserves the nullable dtype (eg when accessing left/right, or mid). There are of course other ways to implement this, but I think as a default storing the EA as left/right array is the easiest solution.

TomAugspurger · 2020-10-13T11:59:56Z

Ah that's a fair point about other extension types I hadn't considered.

IIUC, the main reason for this PR was to clean up the (hacky) code to support IntervalArray.__setitem__. I thought that there was more discussion around performance, but maybe I'm thinking of another issue. I guess I was thinking of #36310.

jreback · 2020-10-13T12:09:47Z

I think this is a better representation actually of the primary usecases of IntervalArrays, until / unless we have another one this is one that keeps consistency of dtypes which has plagued since the beginning. This advocates for 2E EA (@jbrockmendel will be happy).

THe main point though is the mental model is much simpler. This is a collection of Interval points that are joined exactly; 2 independent arrays are not, you have lots of checking are they they same dtype, are they the same length and so on. this is simpler.

Let's not build more than we need now.

jorisvandenbossche · 2020-10-14T12:54:32Z

this is one that keeps consistency of dtypes which has plagued since the beginning.

Not fully sure what you mean here. Can you give an example of a past issue releted to this?

THe main point though is the mental model is much simpler. This is a collection of Interval points that are joined exactly; 2 independent arrays are not, you have lots of checking are they they same dtype, are they the same length and so on. this is simpler.

Well, the current code to deal with that was mostly focused in a single function, i.e. _simple_new. And all that code is still needed after this PR (only moved to a new helper function), because we still allow to create an IntervalArray from two arrays (which needs all this checking upon construction).

Whether a single 2D array is "much simpler" is clearly subjective then, but anyway, my main worry this related to EA support:

Let's not build more than we need now.

We actually need it now. For sure, there is not an open PR or issue about supporting EAs in IntervalArray right now. But we are still actively working on making the nullable dtypes full replacements for their numpy counterparts. Support in IntervalArray is for me part of that "full support".
So yes, I think we should think about how this can fit in with this refactor now, before doing any further work on this code.

jbrockmendel · 2020-10-14T15:07:22Z

you mention this change is mainly for perf reasons. What is the amount of speed-up that this gives? Can you give a few examples? (are there benchmarks for this?)

Not on hand, no.

But we are still actively working on making the nullable dtypes full replacements for their numpy counterparts

full replacements for numpy counterparts would support 2D.

jorisvandenbossche · 2020-10-14T15:29:00Z

full replacements for numpy counterparts would support 2D.

Not necessarily for pandas functionality, though

But that's not the actual discussion here. To repeat again: I think we should be able to have an IntervalArray using EA (eg to support IntervalArray with nullable int/float). How do we do that? (on the short term)

And yes, 2D EAs would be one way. But that's a much bigger discussion, though. So if someone wants to have that discussion, I would suggest to open a separate issue for that.
But I think storing left/right as 1D arrays (as we have always done up to now) is the much easier solution on the short term.

jorisvandenbossche · 2020-10-27T10:31:56Z

What the status of this discussion?
If I recall correctly, on the dev meeting two weeks ago it was said that this might enable performance benefits (and that @jbrockmendel you would try an example perf improvement PR?), and that otherwise there would be no objection to reverting it (is that correct?)

Since I haven't seen further attempts to use this for performance improvements, shall we revert it?

jbrockmendel · 2020-10-27T15:03:12Z

you would try an example perf improvement PR

thanks for the reminder; ill move this up the priority list

jorisvandenbossche · 2020-11-03T09:57:56Z

Any update here? Otherwise I would propose to revert this change.

jbrockmendel · 2020-11-03T14:54:37Z

I've got a branch going on the set ops, but it has devolved into yak shaving. There is code simplification available unrelated to perf, but on the call I agreed not to move forward with any of that in the short-term.

jorisvandenbossche · 2020-11-17T20:53:27Z

Another ping here. If there are no plans for actual improvements based on this change, are you OK with reverting it?

jbrockmendel · 2020-11-18T03:55:36Z

Do code simplifications count as improvements?

The stated reason for wanting to revert is to allow IntegerArray/FloatArray to back it. Are there plans to do that in the near future?

jorisvandenbossche · 2020-11-18T09:19:24Z

Do code simplifications count as improvements?

In principle certainly yes, but in this case I think it is quite subjective. As I said above, I personally don't find this much of a simplification. But it is clear that there is disagreement about this ;)

The stated reason for wanting to revert is to allow IntegerArray/FloatArray to back it. Are there plans to do that in the near future?

As mentioned above: there is no concrete open issue or PR for this, but we are actively working towards full support for the nullable dtypes so they can become the default in a future release. While working on this the coming months, supporting them in IntervalArray will be one of the issues we need to tackle. So yes, there are IMO plans to do this in the near future.

jreback · 2020-11-18T11:53:13Z

was wavering in this but i agree with @jorisvandenbossche logic here for reverting

jbrockmendel · 2020-11-18T16:12:44Z

OK. easy to un-revert once either a) i finish the yak shaving or b) nullable dtypes support 2D.

This reverts commit 9cb3723.

jbrockmendel added 6 commits October 6, 2020 11:54

REF: back IntervalArray by a _single_ ndarray

34b44cb

Merge branch 'master' of https://github.com/pandas-dev/pandas into re…

d4074c7

…f-ia-ii-3

Remove _complicated_new

014c13e

Merge branch 'master' of https://github.com/pandas-dev/pandas into re…

8a4f7d4

…f-ia-ii-3

Merge branch 'master' of https://github.com/pandas-dev/pandas into re…

b93fbc0

…f-ia-ii-3

Merge branch 'master' of https://github.com/pandas-dev/pandas into re…

e9e9b7c

…f-ia-ii-3

jreback requested changes Oct 11, 2020

View reviewed changes

jreback added Interval Interval data type Refactor Internal refactoring of code labels Oct 11, 2020

jbrockmendel added 2 commits October 11, 2020 10:38

Merge branch 'master' of https://github.com/pandas-dev/pandas into re…

763a384

…f-ia-ii-3

annotate

a9adb76

jreback reviewed Oct 11, 2020

View reviewed changes

Merge branch 'master' of https://github.com/pandas-dev/pandas into re…

66c70e8

…f-ia-ii-3

jreback added this to the 1.2 milestone Oct 12, 2020

jreback approved these changes Oct 12, 2020

View reviewed changes

jreback merged commit 9cb3723 into pandas-dev:master Oct 12, 2020

jbrockmendel deleted the ref-ia-ii-3 branch October 13, 2020 00:44

JulianWgs pushed a commit to JulianWgs/pandas that referenced this pull request Oct 26, 2020

REF: back IntervalArray by a single ndarray (pandas-dev#37047)

4bebcb9

jorisvandenbossche mentioned this pull request Oct 27, 2020

REF: IntervalArray comparisons #37124

Merged

5 tasks

kesmit13 pushed a commit to kesmit13/pandas that referenced this pull request Nov 2, 2020

REF: back IntervalArray by a single ndarray (pandas-dev#37047)

ce15244

jorisvandenbossche added a commit to jorisvandenbossche/pandas that referenced this pull request Nov 23, 2020

Revert "REF: back IntervalArray by a single ndarray (pandas-dev#37047)"

8b6a242

This reverts commit 9cb3723.

jorisvandenbossche mentioned this pull request Nov 23, 2020

Revert "REF: back IntervalArray by a single ndarray (#37047)" #38024

Merged

jreback pushed a commit that referenced this pull request Nov 24, 2020

Revert "REF: back IntervalArray by a single ndarray (#37047)" (#38024)

b19e47f

This reverts commit 9cb3723.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

REF: back IntervalArray by a single ndarray #37047

REF: back IntervalArray by a single ndarray #37047

jbrockmendel commented Oct 11, 2020

jreback Oct 11, 2020

jbrockmendel Oct 11, 2020

jreback Oct 11, 2020

jreback Oct 11, 2020

jbrockmendel Oct 11, 2020

jreback commented Oct 12, 2020

TomAugspurger commented Oct 12, 2020 via email

jreback commented Oct 12, 2020

jorisvandenbossche commented Oct 13, 2020

jreback commented Oct 13, 2020

jorisvandenbossche commented Oct 13, 2020

jreback commented Oct 13, 2020

jorisvandenbossche commented Oct 13, 2020

TomAugspurger commented Oct 13, 2020

jreback commented Oct 13, 2020

jorisvandenbossche commented Oct 14, 2020

jbrockmendel commented Oct 14, 2020

jorisvandenbossche commented Oct 14, 2020

jorisvandenbossche commented Oct 27, 2020

jbrockmendel commented Oct 27, 2020

jorisvandenbossche commented Nov 3, 2020

jbrockmendel commented Nov 3, 2020

jorisvandenbossche commented Nov 17, 2020

jbrockmendel commented Nov 18, 2020

jorisvandenbossche commented Nov 18, 2020

jreback commented Nov 18, 2020

jbrockmendel commented Nov 18, 2020



		def _get_combined_data(
		left: Union["Index", ArrayLike], right: Union["Index", ArrayLike]

REF: back IntervalArray by a single ndarray #37047

REF: back IntervalArray by a single ndarray #37047

Conversation

jbrockmendel commented Oct 11, 2020

jreback Oct 11, 2020

Choose a reason for hiding this comment

jbrockmendel Oct 11, 2020

Choose a reason for hiding this comment

jreback Oct 11, 2020

Choose a reason for hiding this comment

jreback Oct 11, 2020

Choose a reason for hiding this comment

jbrockmendel Oct 11, 2020

Choose a reason for hiding this comment

jreback commented Oct 12, 2020

TomAugspurger commented Oct 12, 2020 via email

jreback commented Oct 12, 2020

jorisvandenbossche commented Oct 13, 2020

jreback commented Oct 13, 2020

jorisvandenbossche commented Oct 13, 2020

jreback commented Oct 13, 2020

jorisvandenbossche commented Oct 13, 2020

TomAugspurger commented Oct 13, 2020

jreback commented Oct 13, 2020

jorisvandenbossche commented Oct 14, 2020

jbrockmendel commented Oct 14, 2020

jorisvandenbossche commented Oct 14, 2020

jorisvandenbossche commented Oct 27, 2020

jbrockmendel commented Oct 27, 2020

jorisvandenbossche commented Nov 3, 2020

jbrockmendel commented Nov 3, 2020

jorisvandenbossche commented Nov 17, 2020

jbrockmendel commented Nov 18, 2020

jorisvandenbossche commented Nov 18, 2020

jreback commented Nov 18, 2020

jbrockmendel commented Nov 18, 2020