Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Add sort parameter to set operations for some Indexes and adjust… #24521

Merged
merged 7 commits into from
Jan 19, 2019

Conversation

reidy-p
Copy link
Contributor

@reidy-p reidy-p commented Dec 31, 2018

… tests

This PR makes some progress towards adding a sort parameter with a default of True to the set operations (union, intersection, difference and symmetric_difference) on Index classes. Adding this parameter into every type of Index would result in a very big PR so I have decided to break it up and try to add it in stages. I have tried to focus on Index, MultiIndex, PeriodIndex and Intervalndex in this PR but have made some very small changes to set operations on other indices where necessary.

Some issues to consider:

  • I'm not sure whether it will be possible to control the sorting behaviour of the results of all of the set operations for all of the Index types. For example, because of the way some of the set operations are implemented on some Index types the results may always be sorted even if sort=False is passed (e.g., a union operation on DatetimeIndexs may always return a sorted result in some cases even if sort=False). Perhaps this is not really a problem as long as we document it.
  • There are some other corner cases that always ignore the sort parameter at the moment. For example, the intersection of an unsorted Index with itself will return the original unsorted Index even if sort=True is passed because the sort parameter is simply ignored in this case. Similarly, the intersection of an unsorted Index with an empty Index will also return the original unsorted Index even if sort=True. There is similar behaviour for the other set operations.

@@ -603,7 +603,7 @@ def _wrap_setop_result(self, other, result):
raise ValueError('Passed item and index have different timezone')
return self._shallow_copy(result, name=name, freq=None, tz=self.tz)

def intersection(self, other):
def intersection(self, other, sort=True):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sort parameter is not fully implemented for intersection on DatetimeIndex yet. I just put this in for now to fix some failing tests

@codecov
Copy link

codecov bot commented Dec 31, 2018

Codecov Report

Merging #24521 into master will increase coverage by 0.01%.
The diff coverage is 23.25%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #24521      +/-   ##
==========================================
+ Coverage   31.88%    31.9%   +0.01%     
==========================================
  Files         166      166              
  Lines       52434    52444      +10     
==========================================
+ Hits        16718    16730      +12     
+ Misses      35716    35714       -2
Flag Coverage Δ
#multiple 30.28% <16.27%> (-0.01%) ⬇️
#single 31.9% <23.25%> (+0.01%) ⬆️
Impacted Files Coverage Δ
pandas/core/indexes/interval.py 35.03% <0%> (+0.12%) ⬆️
pandas/core/indexes/api.py 78.21% <0%> (ø) ⬆️
pandas/core/indexes/base.py 47.46% <20.68%> (+0.44%) ⬆️
pandas/core/indexes/range.py 48.97% <25%> (-0.29%) ⬇️
pandas/core/indexes/datetimes.py 37.33% <33.33%> (ø) ⬆️
pandas/core/indexes/multi.py 30.58% <40%> (-0.05%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bf9d41c...4b31df2. Read the comment docs.

@codecov
Copy link

codecov bot commented Dec 31, 2018

Codecov Report

Merging #24521 into master will decrease coverage by <.01%.
The diff coverage is 94.59%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #24521      +/-   ##
==========================================
- Coverage   92.38%   92.38%   -0.01%     
==========================================
  Files         166      166              
  Lines       52382    52392      +10     
==========================================
+ Hits        48395    48404       +9     
- Misses       3987     3988       +1
Flag Coverage Δ
#multiple 90.81% <91.89%> (ø) ⬆️
#single 42.91% <45.94%> (ø) ⬆️
Impacted Files Coverage Δ
pandas/core/indexes/multi.py 95.59% <100%> (ø) ⬆️
pandas/core/indexes/datetimes.py 96.27% <100%> (ø) ⬆️
pandas/core/indexes/api.py 99% <100%> (ø) ⬆️
pandas/core/indexes/interval.py 95.25% <100%> (-0.02%) ⬇️
pandas/io/pytables.py 92.31% <100%> (ø) ⬆️
pandas/core/indexes/range.py 97.36% <100%> (+0.01%) ⬆️
pandas/core/indexes/base.py 96.32% <90.9%> (+0.01%) ⬆️
pandas/util/testing.py 88.04% <0%> (-0.1%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update aa05e54...7b15248. Read the comment docs.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks pretty good.

pandas/_libs/lib.pyx Show resolved Hide resolved
try:
taken = taken.sort_values()
except TypeError as e:
warnings.warn("%s, sort order is undefined for "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this actually get hit? I really hate all of these sort warnings. maybe can you factor them out in a helper function instead?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you use .format here

pandas/core/indexes/base.py Show resolved Hide resolved
pandas/tests/indexes/multi/test_set_ops.py Show resolved Hide resolved
@jreback jreback added the Reshaping Concat, Merge/Join, Stack/Unstack, Explode label Dec 31, 2018
@jreback
Copy link
Contributor

jreback commented Dec 31, 2018

so we can followup up on the 2 cases you mentioned at the top of the PR. pls roll them into an issue (you can expand with check boxes if more clear).

elif how == 'outer':
join_index = self.union(other, sort=sort)
join_index = self.union(other)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On Master union does generally sort by default so I have left the default sort=True here for now for compatibility reasons so the current behaviour of join(..., how='outer') does not change for now and all of the tests expecting the results of this type of join to be sorted don't break.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a TODO

@reidy-p
Copy link
Contributor Author

reidy-p commented Jan 12, 2019

Updated the original issue (#24471) with some of the corner cases I mention above. I will try to address these in subsequent PRs once this PR is finished.

@@ -4473,7 +4473,7 @@ def _reindex_axis(obj, axis, labels, other=None):

labels = ensure_index(labels.unique())
if other is not None:
labels = ensure_index(other.unique()) & labels
labels = ensure_index(other.unique()).intersection(labels, sort=False)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, on Master intersection does not generally sort by default so I have added sort=False here for compatibility

@@ -2359,12 +2359,14 @@ def test_intersection_base(self, sort):
if PY3 and sort:
# unorderable types
warn_type = RuntimeWarning
expected = Index([0, 'a', 1])
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment Index.intersection does not use sorting.safe_sort which tries to sort arrays of mixed type (int and str) by ordering int before str in Python 3. I could try to change this either in this PR or in a separate one if this is getting too big already.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Decided to use sort_safe in Index.intersection in this PR in the next commit

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small comments. ping on green.

pass
if sort:
try:
uniques.sort()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might see if using safe_sort makes sense here (you would have to import here, otherwise this would e circular)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried using safe_sort here but it was causing some problems. The issue seems to be that uniques here is a list of tuples which are then used to construct a MultiIndex in MultiIndex.union. However, when we use safe_sort it turns uniques into an np.array and doesn't sort it correctly so we get the wrong results. I can have a closer look at trying to resolve this if you want but it might involve changing safe_sort a bit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, let's followup up later then (new PR).

if sort:
try:
taken = sorting.safe_sort(taken.values)
if self.name != other.name:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use
_get_reconciled_name_object

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried using get_reconciled_name_object but from the docstring it says that it's used when the "result of the set operation will be self, return self, unless the name changes, in which case make a shallow copy of self". But the result of the set operation is not self in this case, I think.

try:
taken = taken.sort_values()
except TypeError as e:
warnings.warn("%s, sort order is undefined for "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

pandas/core/indexes/base.py Show resolved Hide resolved
elif how == 'outer':
join_index = self.union(other, sort=sort)
join_index = self.union(other)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a TODO

@jreback jreback added this to the 0.24.0 milestone Jan 13, 2019
pass
if sort:
try:
uniques.sort()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

result = sorting.safe_sort(result)
except TypeError as e:
warnings.warn("%s, sort order is undefined for "
"incomparable objects" % e, RuntimeWarning,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you use .format() here

try:
taken = taken.sort_values()
except TypeError as e:
warnings.warn("%s, sort order is undefined for "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you use .format here

@reidy-p reidy-p force-pushed the sort_parameter_set_operations branch from d9e68be to 9549661 Compare January 19, 2019 18:20
@pep8speaks
Copy link

pep8speaks commented Jan 19, 2019

Hello @reidy-p! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on January 19, 2019 at 18:22 Hours UTC

@reidy-p reidy-p force-pushed the sort_parameter_set_operations branch from 9549661 to 7b15248 Compare January 19, 2019 18:22
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couple of comments, but let's do in a followup if you can (otherwise pls make an issue).

pass
if sort:
try:
uniques.sort()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, let's followup up later then (new PR).

@@ -2385,8 +2393,18 @@ def intersection(self, other):
indexer = indexer[indexer != -1]

taken = other.take(indexer)

if sort:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I still think we can use _get_reconcile_name_object here (maybe need to wrap taken in Index), but can do in a followup.

@jreback jreback merged commit bd3c001 into pandas-dev:master Jan 19, 2019
@jreback
Copy link
Contributor

jreback commented Jan 19, 2019

thanks @reidy-p

@reidy-p
Copy link
Contributor Author

reidy-p commented Jan 19, 2019

@jreback thanks. I hope to do a few follow-up PRs on this.

@reidy-p reidy-p deleted the sort_parameter_set_operations branch January 26, 2019 18:11
@shoyer
Copy link
Member

shoyer commented Jan 27, 2019

Was it intended to change the default behavior of Index.intersection here? Previously intersection never sorted, but now it sorts by default.

@jreback
Copy link
Contributor

jreback commented Jan 27, 2019

all of the set ops always sorted; the parameter now allows u to turn it off (default should be the same as before)

@shoyer
Copy link
Member

shoyer commented Jan 27, 2019

This is definitely changed behavior from 0.23.x for intersection (it turned up in a failure in xarray's test suite), but I'll raise another issue about that.

@shoyer
Copy link
Member

shoyer commented Jan 27, 2019

See #24959 for discussion about the behavior change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants