BUG: Fix duplicates in intersection of multiindexes #36927

phofl · 2020-10-06T20:08:06Z

closes BUG: Intersection of multiindex returns duplicates #36915
xref BUG: inconsistent behaviors for Index.union() and Index.intersection() with duplicates #31326 (closes the intersection part)
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

Seems like this was not introduced on purpose. Probably introduced in #31312

arw2019

Confirming that #31312 caused the regression

pandas/core/indexes/multi.py

arw2019 · 2020-10-09T04:15:31Z

Looking at CI possibly related so actually not sure this is good as is

� Conflicts: � doc/source/whatsnew/v1.1.4.rst � pandas/core/indexes/multi.py

phofl · 2020-10-09T06:45:26Z

You are right, this definitely is related. The problem is in

https://github.com/pandas-dev/pandas/blob/653f6944eba664d19e8d93e850340ac039ec452e/pandas/core/ops/__init__.py#L482:L486

I think this should be left.columns.unique() and right.columns.unique() if the intersection should be unique?

cc @jbrockmendel

� Conflicts: � doc/source/whatsnew/v1.1.4.rst

# Conflicts: # pandas/core/indexes/base.py

phofl · 2020-10-11T21:03:20Z

With the change introduced by f4dc9f9 we have to handle both issues at once to avoid bugs at other places

pandas/core/indexes/multi.py

jbrockmendel · 2020-10-26T01:21:59Z

pandas/core/ops/__init__.py

-        if len(cols) and not (cols.equals(left.columns) and cols.equals(right.columns)):
+        if len(cols) and not (
+            cols.equals(left.columns.unique()) and cols.equals(right.columns.unique())
+        ):


do we have a test that fails without this change?

Yes, but the builds are gone. I will run the tests in the evening to find the relevant tests.

pandas/pandas/tests/frame/test_nonunique_indexes.py

Line 10 in 8985801

def test_column_dups_operations(self):

This one is crashing in line 255 because of memory issues. The condition is never fullfilled, so it blows up

the condition in master is never fulfilled or the condition in the PR?

btw, usually let the person who asked the question hit the "resolve conversation" button

Sorry, did not know that. Will do that in the future.

The input df has columns [A,A]. When intersection is unique, then cols=[A], while left.columns=[A,A] and right.columns=[A,A]. Without the change adding the unique, cols.equals(left.columns) and cols.equals(right.columns) will always be False. So we always return True, which results in duplicating the columns for every passthrough, hence the memory explosion.

is this still needed?

we now guaranteee that intersection returns unique columns, right so this should no longer be the case.

Yeah, this is the problem. Is intersection is unique, cols.equals(left.columns) won't be True. This leads to a recursion in the test mentioned which blows up the memory, because we can not exit this.

ok pls add some comments to this effect then.

i would prob calculate left_uniques and right_uniques and make them variables as a bit simpler to read

Thx, changed it and added a comment

pandas/core/indexes/base.py

� Conflicts: � doc/source/whatsnew/v1.2.0.rst � pandas/core/indexes/base.py

phofl · 2020-10-26T19:32:24Z

Had to change a check in merge, which relied on the wrong behavior of intersection.

jbrockmendel · 2020-11-12T03:30:01Z

After taking another look, im thinking we might want to just disallow set ops when there are duplicates

� Conflicts: � doc/source/whatsnew/v1.2.0.rst � pandas/tests/indexes/test_setops.py

phofl · 2020-11-12T11:53:23Z

@jbrockmendel You mean always returning False when left or right contains duplicates? In this case i have changed it accordingly

jbrockmendel · 2020-11-12T16:20:26Z

You mean always returning False when left or right contains duplicates? In this case i have changed it accordingly

I mean that set operations only make sense when dealing with something set-like, i.e. unique. So we could just raise ValueError("Set operations are not well-defined on non-unique Indexes"). Not sure if thats desirable, just thinking out loud.

phofl · 2020-11-12T21:52:10Z

If we do this, we should probably raise DeprecationWarning first to avoid breaking code?

simonjayhawkins · 2020-11-28T15:12:54Z

doc/source/whatsnew/v1.2.0.rst

@@ -774,6 +774,7 @@ Other
 - Passing an array with 2 or more dimensions to the :class:`Series` constructor now raises the more specific ``ValueError`` rather than a bare ``Exception`` (:issue:`35744`)
 - Bug in ``dir`` where ``dir(obj)`` wouldn't show attributes defined on the instance for pandas objects (:issue:`37173`)
 - Bug in :meth:`RangeIndex.difference` returning :class:`Int64Index` in some cases where it should return :class:`RangeIndex` (:issue:`38028`)
+- Bug in :meth:`Index.intersection` returning duplicates when at least one of the indexes had duplicates (:issue:`31326`)


need to remove this, if we are backporting there should be nothing added to doc/source/whatsnew/v1.2.0.rst (doesn't exist on 1.1.x branch)

Sorry, probably messed up the merge. Is gone now

jreback · 2020-11-28T17:12:57Z

this might be a regression but going to be work to backport. @phofl see if you can get all passing on master.

phofl · 2020-11-28T17:54:29Z

Failing tests relied on the wrong behavior, adjusted them

jreback · 2020-11-28T18:01:05Z

pandas/core/ops/__init__.py

-        if len(cols) and not (cols.equals(left.columns) and cols.equals(right.columns)):
+        if len(cols) and not (
+            cols.equals(left.columns.unique()) and cols.equals(right.columns.unique())
+        ):


is this still needed?

we now guaranteee that intersection returns unique columns, right so this should no longer be the case.

jreback · 2020-11-28T18:02:08Z

pandas/core/indexes/base.py

@@ -2858,7 +2859,7 @@ def _intersection(self, other, sort=False):
            indexer = algos.unique1d(Index(rvals).get_indexer_non_unique(lvals)[0])
            indexer = indexer[indexer != -1]

-        result = other.take(indexer)._values
+        result = other.take(indexer).unique()._values


can you add an assert that L2867 is always unique (e.g. we are guaranteed uniqu here L2862 (and I think safe_sort guaranteess this), but let's be explicit (and add a comment).

you should check if this is_unique first. it will be computed but is likely faster than always doing this (and copies yet again).

We may have a numpy array here, so we can not do result.is_unique. Is there a better way than using algos.unique on result and comparing shapes then?

sure you can do (measure perf)

result = Index(other.take(indexer), copy=False) if not result.is_unique: .....

Sorry, did not get that. Did it now, is in #38154

simonjayhawkins · 2020-11-29T11:16:06Z

@phofl @jreback all comments now addressed?

phofl · 2020-11-29T11:46:00Z

One open point is what we should do with the unique check, because is_unique may not exist. This is only relevant to performance

jreback · 2020-11-29T17:21:46Z

lgtm. small point about perf above (but i don't think it matters to backport, can just do on master).

jreback · 2020-11-29T17:21:56Z

thanks @phofl

simonjayhawkins · 2020-11-29T18:21:05Z

@meeseeksdev backport 1.1.x

…multiindexes

…es (#38155) Co-authored-by: patrick <61934744+phofl@users.noreply.github.com>

jbrockmendel · 2020-11-30T18:04:10Z

pandas/core/indexes/multi.py

@@ -3601,6 +3601,8 @@ def intersection(self, other, sort=False):
        other, result_names = self._convert_can_do_setop(other)

        if self.equals(other):
+            if self.has_duplicates:
+                return self.unique().rename(result_names)


@phofl it looks like we do something slightly different in many of the FooIndex.intersection methods in the self.equals(other) case:

# Index (and IntervalIndex pending #38190) if self.equals(other) and not self.has_duplicates: return self._get_reconciled_name_object(other) # datetimelike if self.equals(other): return self._get_reconciled_name_object(other) # MultiIndex if self.equals(other): if self.has_duplicates: return self.unique().rename(result_names) return self._get_reconciled_name_object(other) # PeriodIndex if self.equals(other): return self._get_reconciled_name_object(other) # RangeIndex if self.equals(other): return self._get_reconciled_name_object(other)

The RangeIndex one is equivalent to the Index/Intervalindex bc it never has duplicates (can add that check to make the code match exactly). Can the others be made to have identical logic?

Is there still an open case where I can be of help?

I think we're all set for now, will take another look after all the current Index.intersection PRs go through. thanks

It looks like several of them are still either

if self.equals(other): return self._get_reconciled_name_object(other)

which i think is wrong if it has duplicates, or

if self.equals(other) and not self.has_duplicates: return self._get_reconciled_name_object(other)

which isnt handling the has-duplicates case like the others. can these all be identical?

Also if you really get on a roll, DatetimeTimedeltaMixin._intersection has cases for len(self) == 0 and len(other)==0 that would be nice to standardize+test. RangeIndex._intersection has a similar check.

Will look through them and try to standardize them as much as possible

Fix duplicates in intersectin of multiindexes

cdefaae

phofl added MultiIndex Regression Functionality that used to work in a prior pandas version labels Oct 6, 2020

phofl added 3 commits October 6, 2020 22:33

Fix duplicates in index intersection

fbd63f2

Modify test and avoid None issues

53a37d1

Fix failing test

5675a4e

arw2019 approved these changes Oct 9, 2020

View reviewed changes

pandas/core/indexes/multi.py Outdated Show resolved Hide resolved

Merge branch 'master' of https://github.com/pandas-dev/pandas into 36915

134936c

� Conflicts: � doc/source/whatsnew/v1.1.4.rst � pandas/core/indexes/multi.py

phofl added 6 commits October 9, 2020 09:16

Change comment

582c0b9

Add unique after intersection

7805de5

Merge branch 'master' of https://github.com/pandas-dev/pandas into 36915

67691df

� Conflicts: � doc/source/whatsnew/v1.1.4.rst

Merge branch '31326' into 36915

8fb0055

# Conflicts: # pandas/core/indexes/base.py

Fix merge bug

66b519f

Add tests and whatsnew

cb1477b

jbrockmendel reviewed Oct 26, 2020

View reviewed changes

pandas/core/indexes/multi.py Outdated Show resolved Hide resolved

jbrockmendel reviewed Oct 26, 2020

View reviewed changes

jreback requested changes Oct 26, 2020

View reviewed changes

pandas/core/indexes/base.py Outdated Show resolved Hide resolved

phofl added 3 commits October 26, 2020 15:45

Add rename

0fb2561

Merge branch 'master' of https://github.com/pandas-dev/pandas into 36915

3c19d57

� Conflicts: � doc/source/whatsnew/v1.2.0.rst � pandas/core/indexes/base.py

Fix check in merge operation

10524fd

phofl added 2 commits November 12, 2020 12:51

Exit set ops when nonunique

3dde0ee

Merge branch 'master' of https://github.com/pandas-dev/pandas into 36915

a0a1a33

� Conflicts: � doc/source/whatsnew/v1.2.0.rst � pandas/tests/indexes/test_setops.py

phofl added 2 commits November 27, 2020 19:30

Change gh reference

742716e

Remove pd

321797a

simonjayhawkins reviewed Nov 28, 2020

View reviewed changes

simonjayhawkins added a commit to simonjayhawkins/pandas that referenced this pull request Nov 28, 2020

test backportability of pandas-dev#36927

b0913c0

phofl added 2 commits November 28, 2020 18:51

Remove whatsnew from 1.2

a980ec0

Fix test

972fd48

jreback requested changes Nov 28, 2020

View reviewed changes

phofl added 2 commits November 28, 2020 19:27

Make condition more clear and add assert

fe1ded4

Use shape for equality check

8e4d47b

jreback approved these changes Nov 29, 2020

View reviewed changes

jreback merged commit e99e5ab into pandas-dev:master Nov 29, 2020

phofl deleted the 36915 branch November 29, 2020 17:23

This comment has been minimized.

Sign in to view

lumberbot-app bot added the Still Needs Manual Backport label Nov 29, 2020

simonjayhawkins pushed a commit to simonjayhawkins/pandas that referenced this pull request Nov 29, 2020

Backport PR pandas-dev#36927: BUG: Fix duplicates in intersection of …

c6494a4

…multiindexes

simonjayhawkins mentioned this pull request Nov 29, 2020

Backport PR #36927: BUG: Fix duplicates in intersection of multiindexes #38155

Merged

simonjayhawkins removed the Still Needs Manual Backport label Nov 29, 2020

simonjayhawkins added a commit that referenced this pull request Nov 30, 2020

Backport PR #36927: BUG: Fix duplicates in intersection of multiindex…

8a2b8e2

…es (#38155) Co-authored-by: patrick <61934744+phofl@users.noreply.github.com>

jbrockmendel reviewed Nov 30, 2020

View reviewed changes

jorisvandenbossche mentioned this pull request Dec 4, 2020

BUG: Index.intersection casting to object instead of numeric #38122

Merged

5 tasks

simonjayhawkins mentioned this pull request Dec 10, 2020

BUG: Pandas 1.1.5 location-based indexing error with quantized pivot table #38367

Closed

3 tasks

simonjayhawkins mentioned this pull request Feb 12, 2021

BUG: DataFrame.to_excel() now raises if column parameter contains duplicates #39695

Closed

3 tasks

simonjayhawkins mentioned this pull request Dec 17, 2021

BUG: misleading error message when aggregating duplicate column names in groupby #44924

Closed

3 tasks

BUG: Fix duplicates in intersection of multiindexes #36927

BUG: Fix duplicates in intersection of multiindexes #36927

Conversation

phofl commented Oct 6, 2020 • edited Loading

arw2019 left a comment

Choose a reason for hiding this comment

arw2019 commented Oct 9, 2020

phofl commented Oct 9, 2020

phofl commented Oct 11, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

phofl commented Oct 26, 2020 • edited Loading

jbrockmendel commented Nov 12, 2020

phofl commented Nov 12, 2020

jbrockmendel commented Nov 12, 2020

phofl commented Nov 12, 2020

simonjayhawkins Nov 28, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Nov 28, 2020

phofl commented Nov 28, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

simonjayhawkins commented Nov 29, 2020

phofl commented Nov 29, 2020

jreback commented Nov 29, 2020

jreback commented Nov 29, 2020

simonjayhawkins commented Nov 29, 2020

This comment has been minimized.

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

phofl commented Oct 6, 2020 •

edited

Loading

phofl commented Oct 26, 2020 •

edited

Loading

simonjayhawkins Nov 28, 2020 •

edited

Loading