Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Fix duplicates in intersection of multiindexes #36927

Merged
merged 27 commits into from
Nov 29, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
cdefaae
Fix duplicates in intersectin of multiindexes
phofl Oct 6, 2020
fbd63f2
Fix duplicates in index intersection
phofl Oct 6, 2020
53a37d1
Modify test and avoid None issues
phofl Oct 6, 2020
5675a4e
Fix failing test
phofl Oct 6, 2020
134936c
Merge branch 'master' of https://github.com/pandas-dev/pandas into 36915
phofl Oct 9, 2020
582c0b9
Change comment
phofl Oct 9, 2020
7805de5
Add unique after intersection
phofl Oct 11, 2020
67691df
Merge branch 'master' of https://github.com/pandas-dev/pandas into 36915
phofl Oct 11, 2020
8fb0055
Merge branch '31326' into 36915
phofl Oct 11, 2020
66b519f
Fix merge bug
phofl Oct 11, 2020
cb1477b
Add tests and whatsnew
phofl Oct 11, 2020
0fb2561
Add rename
phofl Oct 26, 2020
3c19d57
Merge branch 'master' of https://github.com/pandas-dev/pandas into 36915
phofl Oct 26, 2020
10524fd
Fix check in merge operation
phofl Oct 26, 2020
3dde0ee
Exit set ops when nonunique
phofl Nov 12, 2020
a0a1a33
Merge branch 'master' of https://github.com/pandas-dev/pandas into 36915
phofl Nov 12, 2020
45dfb84
Merge branch 'master' of https://github.com/pandas-dev/pandas into 36915
phofl Nov 22, 2020
d71a499
Roll back to initial version
phofl Nov 22, 2020
d873d5a
Change whatsnew
phofl Nov 22, 2020
e90239a
Move whatsnew
phofl Nov 27, 2020
c2b448a
Merge branch 'master' of https://github.com/pandas-dev/pandas into 36915
phofl Nov 27, 2020
742716e
Change gh reference
phofl Nov 27, 2020
321797a
Remove pd
phofl Nov 27, 2020
a980ec0
Remove whatsnew from 1.2
phofl Nov 28, 2020
972fd48
Fix test
phofl Nov 28, 2020
fe1ded4
Make condition more clear and add assert
phofl Nov 28, 2020
8e4d47b
Use shape for equality check
phofl Nov 28, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v1.1.5.rst
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@ Fixed regressions
- Fixed regression in indexing on a :class:`Series` with ``CategoricalDtype`` after unpickling (:issue:`37631`)
- Fixed regression in ``df.groupby(..).rolling(..)`` with the resulting :class:`MultiIndex` when grouping by a label that is in the index (:issue:`37641`)
- Fixed regression in :meth:`DataFrame.fillna` not filling ``NaN`` after other operations such as :meth:`DataFrame.pivot` (:issue:`36495`).
- Fixed regression in :meth:`MultiIndex.intersection` returning duplicates when at least one of the indexes had duplicates (:issue:`36915`)

.. ---------------------------------------------------------------------------

Expand Down
9 changes: 6 additions & 3 deletions pandas/core/indexes/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -2822,7 +2822,7 @@ def intersection(self, other, sort=False):
self._assert_can_do_setop(other)
other = ensure_index(other)

if self.equals(other):
if self.equals(other) and not self.has_duplicates:
return self._get_reconciled_name_object(other)

if not is_dtype_equal(self.dtype, other.dtype):
Expand All @@ -2847,7 +2847,7 @@ def _intersection(self, other, sort=False):
except TypeError:
pass
else:
return result
return algos.unique1d(result)

try:
indexer = Index(rvals).get_indexer(lvals)
Expand All @@ -2858,11 +2858,14 @@ def _intersection(self, other, sort=False):
indexer = algos.unique1d(Index(rvals).get_indexer_non_unique(lvals)[0])
indexer = indexer[indexer != -1]

result = other.take(indexer)._values
result = other.take(indexer).unique()._values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add an assert that L2867 is always unique (e.g. we are guaranteed uniqu here L2862 (and I think safe_sort guaranteess this), but let's be explicit (and add a comment).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should check if this is_unique first. it will be computed but is likely faster than always doing this (and copies yet again).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may have a numpy array here, so we can not do result.is_unique. Is there a better way than using algos.unique on result and comparing shapes then?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure you can do (measure perf)

result = Index(other.take(indexer), copy=False)
if not result.is_unique:
 .....

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, did not get that. Did it now, is in #38154


if sort is None:
result = algos.safe_sort(result)

# Intersection has to be unique
assert algos.unique(result).shape == result.shape

return result

def difference(self, other, sort=None):
Expand Down
8 changes: 6 additions & 2 deletions pandas/core/indexes/multi.py
Original file line number Diff line number Diff line change
Expand Up @@ -3601,6 +3601,8 @@ def intersection(self, other, sort=False):
other, result_names = self._convert_can_do_setop(other)

if self.equals(other):
if self.has_duplicates:
return self.unique().rename(result_names)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@phofl it looks like we do something slightly different in many of the FooIndex.intersection methods in the self.equals(other) case:

# Index (and IntervalIndex pending #38190)
        if self.equals(other) and not self.has_duplicates:
            return self._get_reconciled_name_object(other)

# datetimelike
        if self.equals(other):
            return self._get_reconciled_name_object(other)

# MultiIndex
        if self.equals(other):
            if self.has_duplicates:
                return self.unique().rename(result_names)
            return self._get_reconciled_name_object(other)

# PeriodIndex
        if self.equals(other):
            return self._get_reconciled_name_object(other)

# RangeIndex
        if self.equals(other):
            return self._get_reconciled_name_object(other)

The RangeIndex one is equivalent to the Index/Intervalindex bc it never has duplicates (can add that check to make the code match exactly). Can the others be made to have identical logic?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there still an open case where I can be of help?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're all set for now, will take another look after all the current Index.intersection PRs go through. thanks

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like several of them are still either

        if self.equals(other):
            return self._get_reconciled_name_object(other)

which i think is wrong if it has duplicates, or

        if self.equals(other) and not self.has_duplicates:
            return self._get_reconciled_name_object(other)

which isnt handling the has-duplicates case like the others. can these all be identical?

Also if you really get on a roll, DatetimeTimedeltaMixin._intersection has cases for len(self) == 0 and len(other)==0 that would be nice to standardize+test. RangeIndex._intersection has a similar check.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will look through them and try to standardize them as much as possible

return self.rename(result_names)

if not is_object_dtype(other.dtype):
Expand All @@ -3619,10 +3621,12 @@ def intersection(self, other, sort=False):
uniq_tuples = None # flag whether _inner_indexer was successful
if self.is_monotonic and other.is_monotonic:
try:
uniq_tuples = self._inner_indexer(lvals, rvals)[0]
sort = False # uniq_tuples is already sorted
inner_tuples = self._inner_indexer(lvals, rvals)[0]
sort = False # inner_tuples is already sorted
except TypeError:
pass
else:
uniq_tuples = algos.unique(inner_tuples)

if uniq_tuples is None:
other_uniq = set(rvals)
Expand Down
5 changes: 4 additions & 1 deletion pandas/core/ops/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -311,7 +311,10 @@ def should_reindex_frame_op(
# TODO: any other cases we should handle here?
cols = left.columns.intersection(right.columns)

if len(cols) and not (cols.equals(left.columns) and cols.equals(right.columns)):
# Intersection is always unique so we have to check the unique columns
left_uniques = left.columns.unique()
right_uniques = right.columns.unique()
if len(cols) and not (cols.equals(left_uniques) and cols.equals(right_uniques)):
# TODO: is there a shortcut available when len(cols) == 0?
return True

Expand Down
9 changes: 7 additions & 2 deletions pandas/core/reshape/merge.py
Original file line number Diff line number Diff line change
Expand Up @@ -1271,7 +1271,9 @@ def _validate_specification(self):
raise MergeError("Must pass left_on or left_index=True")
else:
# use the common columns
common_cols = self.left.columns.intersection(self.right.columns)
left_cols = self.left.columns
right_cols = self.right.columns
common_cols = left_cols.intersection(right_cols)
if len(common_cols) == 0:
raise MergeError(
"No common columns to perform merge on. "
Expand All @@ -1280,7 +1282,10 @@ def _validate_specification(self):
f"left_index={self.left_index}, "
f"right_index={self.right_index}"
)
if not common_cols.is_unique:
if (
not left_cols.join(common_cols, how="inner").is_unique
or not right_cols.join(common_cols, how="inner").is_unique
):
raise MergeError(f"Data columns not unique: {repr(common_cols)}")
self.left_on = self.right_on = common_cols
elif self.on is not None:
Expand Down
2 changes: 1 addition & 1 deletion pandas/tests/indexes/base_class/test_setops.py
Original file line number Diff line number Diff line change
Expand Up @@ -141,7 +141,7 @@ def test_intersection_str_dates(self, sort):

@pytest.mark.parametrize(
"index2,expected_arr",
[(Index(["B", "D"]), ["B"]), (Index(["B", "D", "A"]), ["A", "B", "A"])],
[(Index(["B", "D"]), ["B"]), (Index(["B", "D", "A"]), ["A", "B"])],
)
def test_intersection_non_monotonic_non_unique(self, index2, expected_arr, sort):
# non-monotonic non-unique
Expand Down
23 changes: 23 additions & 0 deletions pandas/tests/indexes/multi/test_setops.py
Original file line number Diff line number Diff line change
Expand Up @@ -378,3 +378,26 @@ def test_setops_disallow_true(method):

with pytest.raises(ValueError, match="The 'sort' keyword only takes"):
getattr(idx1, method)(idx2, sort=True)


@pytest.mark.parametrize(
("tuples", "exp_tuples"),
[
([("val1", "test1")], [("val1", "test1")]),
([("val1", "test1"), ("val1", "test1")], [("val1", "test1")]),
(
[("val2", "test2"), ("val1", "test1")],
[("val2", "test2"), ("val1", "test1")],
),
],
)
def test_intersect_with_duplicates(tuples, exp_tuples):
# GH#36915
left = MultiIndex.from_tuples(tuples, names=["first", "second"])
right = MultiIndex.from_tuples(
[("val1", "test1"), ("val1", "test1"), ("val2", "test2")],
names=["first", "second"],
)
result = left.intersection(right)
expected = MultiIndex.from_tuples(exp_tuples, names=["first", "second"])
tm.assert_index_equal(result, expected)
10 changes: 10 additions & 0 deletions pandas/tests/indexes/test_setops.py
Original file line number Diff line number Diff line change
Expand Up @@ -120,6 +120,16 @@ def test_dunder_inplace_setops_deprecated(index):
index ^= index


@pytest.mark.parametrize("values", [[1, 2, 2, 3], [3, 3]])
def test_intersection_duplicates(values):
# GH#31326
a = pd.Index(values)
b = pd.Index([3, 3])
result = a.intersection(b)
expected = pd.Index([3])
tm.assert_index_equal(result, expected)


class TestSetOps:
# Set operation tests shared by all indexes in the `index` fixture
@pytest.mark.parametrize("case", [0.5, "xxx"])
Expand Down
2 changes: 1 addition & 1 deletion pandas/tests/reshape/merge/test_merge.py
Original file line number Diff line number Diff line change
Expand Up @@ -753,7 +753,7 @@ def test_overlapping_columns_error_message(self):

# #2649, #10639
df2.columns = ["key1", "foo", "foo"]
msg = r"Data columns not unique: Index\(\['foo', 'foo'\], dtype='object'\)"
msg = r"Data columns not unique: Index\(\['foo'\], dtype='object'\)"
with pytest.raises(MergeError, match=msg):
merge(df, df2)

Expand Down