Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Fix duplicates in intersection of multiindexes #36927

Merged
merged 27 commits into from
Nov 29, 2020

Conversation

phofl
Copy link
Member

@phofl phofl commented Oct 6, 2020

Seems like this was not introduced on purpose. Probably introduced in #31312

@phofl phofl added MultiIndex Regression Functionality that used to work in a prior pandas version labels Oct 6, 2020
Copy link
Member

@arw2019 arw2019 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirming that #31312 caused the regression

pandas/core/indexes/multi.py Outdated Show resolved Hide resolved
@arw2019
Copy link
Member

arw2019 commented Oct 9, 2020

Looking at CI possibly related so actually not sure this is good as is

� Conflicts:
�	doc/source/whatsnew/v1.1.4.rst
�	pandas/core/indexes/multi.py
@phofl
Copy link
Member Author

phofl commented Oct 9, 2020

You are right, this definitely is related. The problem is in

https://github.com/pandas-dev/pandas/blob/653f6944eba664d19e8d93e850340ac039ec452e/pandas/core/ops/__init__.py#L482:L486

I think this should be left.columns.unique() and right.columns.unique() if the intersection should be unique?

cc @jbrockmendel

@phofl
Copy link
Member Author

phofl commented Oct 11, 2020

With the change introduced by f4dc9f9 we have to handle both issues at once to avoid bugs at other places

if len(cols) and not (cols.equals(left.columns) and cols.equals(right.columns)):
if len(cols) and not (
cols.equals(left.columns.unique()) and cols.equals(right.columns.unique())
):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have a test that fails without this change?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but the builds are gone. I will run the tests in the evening to find the relevant tests.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

def test_column_dups_operations(self):

This one is crashing in line 255 because of memory issues. The condition is never fullfilled, so it blows up

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the condition in master is never fulfilled or the condition in the PR?

btw, usually let the person who asked the question hit the "resolve conversation" button

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, did not know that. Will do that in the future.

The input df has columns [A,A]. When intersection is unique, then cols=[A], while left.columns=[A,A] and right.columns=[A,A]. Without the change adding the unique, cols.equals(left.columns) and cols.equals(right.columns) will always be False. So we always return True, which results in duplicating the columns for every passthrough, hence the memory explosion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this still needed?

we now guaranteee that intersection returns unique columns, right so this should no longer be the case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is the problem. Is intersection is unique, cols.equals(left.columns) won't be True. This leads to a recursion in the test mentioned which blows up the memory, because we can not exit this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok pls add some comments to this effect then.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would prob calculate left_uniques and right_uniques and make them variables as a bit simpler to read

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx, changed it and added a comment

pandas/core/indexes/base.py Outdated Show resolved Hide resolved
@phofl
Copy link
Member Author

phofl commented Oct 26, 2020

Had to change a check in merge, which relied on the wrong behavior of intersection.

@jbrockmendel
Copy link
Member

After taking another look, im thinking we might want to just disallow set ops when there are duplicates

� Conflicts:
�	doc/source/whatsnew/v1.2.0.rst
�	pandas/tests/indexes/test_setops.py
@phofl
Copy link
Member Author

phofl commented Nov 12, 2020

@jbrockmendel You mean always returning False when left or right contains duplicates? In this case i have changed it accordingly

@jbrockmendel
Copy link
Member

You mean always returning False when left or right contains duplicates? In this case i have changed it accordingly

I mean that set operations only make sense when dealing with something set-like, i.e. unique. So we could just raise ValueError("Set operations are not well-defined on non-unique Indexes"). Not sure if thats desirable, just thinking out loud.

@phofl
Copy link
Member Author

phofl commented Nov 12, 2020

If we do this, we should probably raise DeprecationWarning first to avoid breaking code?

@@ -774,6 +774,7 @@ Other
- Passing an array with 2 or more dimensions to the :class:`Series` constructor now raises the more specific ``ValueError`` rather than a bare ``Exception`` (:issue:`35744`)
- Bug in ``dir`` where ``dir(obj)`` wouldn't show attributes defined on the instance for pandas objects (:issue:`37173`)
- Bug in :meth:`RangeIndex.difference` returning :class:`Int64Index` in some cases where it should return :class:`RangeIndex` (:issue:`38028`)
- Bug in :meth:`Index.intersection` returning duplicates when at least one of the indexes had duplicates (:issue:`31326`)
Copy link
Member

@simonjayhawkins simonjayhawkins Nov 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to remove this, if we are backporting there should be nothing added to doc/source/whatsnew/v1.2.0.rst (doesn't exist on 1.1.x branch)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, probably messed up the merge. Is gone now

simonjayhawkins added a commit to simonjayhawkins/pandas that referenced this pull request Nov 28, 2020
@jreback
Copy link
Contributor

jreback commented Nov 28, 2020

this might be a regression but going to be work to backport. @phofl see if you can get all passing on master.

@phofl
Copy link
Member Author

phofl commented Nov 28, 2020

Failing tests relied on the wrong behavior, adjusted them

if len(cols) and not (cols.equals(left.columns) and cols.equals(right.columns)):
if len(cols) and not (
cols.equals(left.columns.unique()) and cols.equals(right.columns.unique())
):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this still needed?

we now guaranteee that intersection returns unique columns, right so this should no longer be the case.

@@ -2858,7 +2859,7 @@ def _intersection(self, other, sort=False):
indexer = algos.unique1d(Index(rvals).get_indexer_non_unique(lvals)[0])
indexer = indexer[indexer != -1]

result = other.take(indexer)._values
result = other.take(indexer).unique()._values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add an assert that L2867 is always unique (e.g. we are guaranteed uniqu here L2862 (and I think safe_sort guaranteess this), but let's be explicit (and add a comment).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should check if this is_unique first. it will be computed but is likely faster than always doing this (and copies yet again).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may have a numpy array here, so we can not do result.is_unique. Is there a better way than using algos.unique on result and comparing shapes then?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure you can do (measure perf)

result = Index(other.take(indexer), copy=False)
if not result.is_unique:
 .....

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, did not get that. Did it now, is in #38154

@simonjayhawkins
Copy link
Member

@phofl @jreback all comments now addressed?

@phofl
Copy link
Member Author

phofl commented Nov 29, 2020

One open point is what we should do with the unique check, because is_unique may not exist. This is only relevant to performance

@jreback
Copy link
Contributor

jreback commented Nov 29, 2020

lgtm. small point about perf above (but i don't think it matters to backport, can just do on master).

@jreback jreback merged commit e99e5ab into pandas-dev:master Nov 29, 2020
@jreback
Copy link
Contributor

jreback commented Nov 29, 2020

thanks @phofl

@phofl phofl deleted the 36915 branch November 29, 2020 17:23
@simonjayhawkins
Copy link
Member

@meeseeksdev backport 1.1.x

@lumberbot-app

This comment has been minimized.

simonjayhawkins pushed a commit to simonjayhawkins/pandas that referenced this pull request Nov 29, 2020
simonjayhawkins added a commit that referenced this pull request Nov 30, 2020
…es (#38155)

Co-authored-by: patrick <61934744+phofl@users.noreply.github.com>
@@ -3601,6 +3601,8 @@ def intersection(self, other, sort=False):
other, result_names = self._convert_can_do_setop(other)

if self.equals(other):
if self.has_duplicates:
return self.unique().rename(result_names)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@phofl it looks like we do something slightly different in many of the FooIndex.intersection methods in the self.equals(other) case:

# Index (and IntervalIndex pending #38190)
        if self.equals(other) and not self.has_duplicates:
            return self._get_reconciled_name_object(other)

# datetimelike
        if self.equals(other):
            return self._get_reconciled_name_object(other)

# MultiIndex
        if self.equals(other):
            if self.has_duplicates:
                return self.unique().rename(result_names)
            return self._get_reconciled_name_object(other)

# PeriodIndex
        if self.equals(other):
            return self._get_reconciled_name_object(other)

# RangeIndex
        if self.equals(other):
            return self._get_reconciled_name_object(other)

The RangeIndex one is equivalent to the Index/Intervalindex bc it never has duplicates (can add that check to make the code match exactly). Can the others be made to have identical logic?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there still an open case where I can be of help?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're all set for now, will take another look after all the current Index.intersection PRs go through. thanks

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like several of them are still either

        if self.equals(other):
            return self._get_reconciled_name_object(other)

which i think is wrong if it has duplicates, or

        if self.equals(other) and not self.has_duplicates:
            return self._get_reconciled_name_object(other)

which isnt handling the has-duplicates case like the others. can these all be identical?

Also if you really get on a roll, DatetimeTimedeltaMixin._intersection has cases for len(self) == 0 and len(other)==0 that would be nice to standardize+test. RangeIndex._intersection has a similar check.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will look through them and try to standardize them as much as possible

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
MultiIndex Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: Intersection of multiindex returns duplicates
5 participants