-
-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Fix duplicates in intersection of multiindexes #36927
Changes from all commits
cdefaae
fbd63f2
53a37d1
5675a4e
134936c
582c0b9
7805de5
67691df
8fb0055
66b519f
cb1477b
0fb2561
3c19d57
10524fd
3dde0ee
a0a1a33
45dfb84
d71a499
d873d5a
e90239a
c2b448a
742716e
321797a
a980ec0
972fd48
fe1ded4
8e4d47b
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -3601,6 +3601,8 @@ def intersection(self, other, sort=False): | |
other, result_names = self._convert_can_do_setop(other) | ||
|
||
if self.equals(other): | ||
if self.has_duplicates: | ||
return self.unique().rename(result_names) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @phofl it looks like we do something slightly different in many of the
The RangeIndex one is equivalent to the Index/Intervalindex bc it never has duplicates (can add that check to make the code match exactly). Can the others be made to have identical logic? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is there still an open case where I can be of help? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we're all set for now, will take another look after all the current Index.intersection PRs go through. thanks There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Great There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It looks like several of them are still either
which i think is wrong if it has duplicates, or
which isnt handling the has-duplicates case like the others. can these all be identical? Also if you really get on a roll, There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Will look through them and try to standardize them as much as possible |
||
return self.rename(result_names) | ||
|
||
if not is_object_dtype(other.dtype): | ||
|
@@ -3619,10 +3621,12 @@ def intersection(self, other, sort=False): | |
uniq_tuples = None # flag whether _inner_indexer was successful | ||
if self.is_monotonic and other.is_monotonic: | ||
try: | ||
uniq_tuples = self._inner_indexer(lvals, rvals)[0] | ||
sort = False # uniq_tuples is already sorted | ||
inner_tuples = self._inner_indexer(lvals, rvals)[0] | ||
sort = False # inner_tuples is already sorted | ||
except TypeError: | ||
pass | ||
else: | ||
uniq_tuples = algos.unique(inner_tuples) | ||
|
||
if uniq_tuples is None: | ||
other_uniq = set(rvals) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add an assert that L2867 is always unique (e.g. we are guaranteed uniqu here L2862 (and I think safe_sort guaranteess this), but let's be explicit (and add a comment).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you should check if this is_unique first. it will be computed but is likely faster than always doing this (and copies yet again).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may have a numpy array here, so we can not do result.is_unique. Is there a better way than using algos.unique on result and comparing shapes then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure you can do (measure perf)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, did not get that. Did it now, is in #38154