-
-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Fix duplicates in intersection of multiindexes #36927
Changes from 23 commits
cdefaae
fbd63f2
53a37d1
5675a4e
134936c
582c0b9
7805de5
67691df
8fb0055
66b519f
cb1477b
0fb2561
3c19d57
10524fd
3dde0ee
a0a1a33
45dfb84
d71a499
d873d5a
e90239a
c2b448a
742716e
321797a
a980ec0
972fd48
fe1ded4
8e4d47b
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -2822,7 +2822,7 @@ def intersection(self, other, sort=False): | |
self._assert_can_do_setop(other) | ||
other = ensure_index(other) | ||
|
||
if self.equals(other): | ||
if self.equals(other) and not self.has_duplicates: | ||
return self._get_reconciled_name_object(other) | ||
|
||
if not is_dtype_equal(self.dtype, other.dtype): | ||
|
@@ -2847,6 +2847,7 @@ def _intersection(self, other, sort=False): | |
except TypeError: | ||
pass | ||
else: | ||
result = algos.unique1d(result) | ||
phofl marked this conversation as resolved.
Show resolved
Hide resolved
|
||
return result | ||
|
||
try: | ||
|
@@ -2858,7 +2859,7 @@ def _intersection(self, other, sort=False): | |
indexer = algos.unique1d(Index(rvals).get_indexer_non_unique(lvals)[0]) | ||
indexer = indexer[indexer != -1] | ||
|
||
result = other.take(indexer)._values | ||
result = other.take(indexer).unique()._values | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can you add an assert that L2867 is always unique (e.g. we are guaranteed uniqu here L2862 (and I think safe_sort guaranteess this), but let's be explicit (and add a comment). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. you should check if this is_unique first. it will be computed but is likely faster than always doing this (and copies yet again). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We may have a numpy array here, so we can not do result.is_unique. Is there a better way than using algos.unique on result and comparing shapes then? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. sure you can do (measure perf)
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sorry, did not get that. Did it now, is in #38154 |
||
|
||
if sort is None: | ||
result = algos.safe_sort(result) | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -3601,6 +3601,8 @@ def intersection(self, other, sort=False): | |
other, result_names = self._convert_can_do_setop(other) | ||
|
||
if self.equals(other): | ||
if self.has_duplicates: | ||
return self.unique().rename(result_names) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @phofl it looks like we do something slightly different in many of the
The RangeIndex one is equivalent to the Index/Intervalindex bc it never has duplicates (can add that check to make the code match exactly). Can the others be made to have identical logic? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is there still an open case where I can be of help? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we're all set for now, will take another look after all the current Index.intersection PRs go through. thanks There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Great There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It looks like several of them are still either
which i think is wrong if it has duplicates, or
which isnt handling the has-duplicates case like the others. can these all be identical? Also if you really get on a roll, There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Will look through them and try to standardize them as much as possible |
||
return self.rename(result_names) | ||
|
||
if not is_object_dtype(other.dtype): | ||
|
@@ -3619,10 +3621,12 @@ def intersection(self, other, sort=False): | |
uniq_tuples = None # flag whether _inner_indexer was successful | ||
if self.is_monotonic and other.is_monotonic: | ||
try: | ||
uniq_tuples = self._inner_indexer(lvals, rvals)[0] | ||
sort = False # uniq_tuples is already sorted | ||
inner_tuples = self._inner_indexer(lvals, rvals)[0] | ||
sort = False # inner_tuples is already sorted | ||
except TypeError: | ||
pass | ||
else: | ||
uniq_tuples = algos.unique(inner_tuples) | ||
|
||
if uniq_tuples is None: | ||
other_uniq = set(rvals) | ||
|
Original file line number | Diff line number | Diff line change | ||
---|---|---|---|---|
|
@@ -311,7 +311,9 @@ def should_reindex_frame_op( | |||
# TODO: any other cases we should handle here? | ||||
cols = left.columns.intersection(right.columns) | ||||
|
||||
if len(cols) and not (cols.equals(left.columns) and cols.equals(right.columns)): | ||||
if len(cols) and not ( | ||||
cols.equals(left.columns.unique()) and cols.equals(right.columns.unique()) | ||||
): | ||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. do we have a test that fails without this change? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, but the builds are gone. I will run the tests in the evening to find the relevant tests. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
This one is crashing in line 255 because of memory issues. The condition is never fullfilled, so it blows up There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. the condition in master is never fulfilled or the condition in the PR? btw, usually let the person who asked the question hit the "resolve conversation" button There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sorry, did not know that. Will do that in the future. The input There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. is this still needed? we now guaranteee that intersection returns unique columns, right so this should no longer be the case. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, this is the problem. Is intersection is unique, cols.equals(left.columns) won't be True. This leads to a recursion in the test mentioned which blows up the memory, because we can not exit this. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ok pls add some comments to this effect then. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. i would prob calculate left_uniques and right_uniques and make them variables as a bit simpler to read There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thx, changed it and added a comment |
||||
# TODO: is there a shortcut available when len(cols) == 0? | ||||
return True | ||||
|
||||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to remove this, if we are backporting there should be nothing added to doc/source/whatsnew/v1.2.0.rst (doesn't exist on 1.1.x branch)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, probably messed up the merge. Is gone now