Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: fix union_indexes not supporting sort=False for Index subclasses #35098

Merged
merged 44 commits into from
Jul 9, 2020

Conversation

AlexKirko
Copy link
Member

@AlexKirko AlexKirko commented Jul 2, 2020

Scope of PR

The reason for DataFrame.append unnecessarily sorting columns turned out to be that when merging indexes we were not passing the sort argument for an Index subclass. This PR fixes this bug.

Details

At some point down the call stack, DataFrame.append calls union_indexes in core.indexes.api. In it we detect the type of our indexes:

indexes, kind = _sanitize_and_check(indexes)

We don't pass sort for kind == 'special' in union_indexes.

I had some other ideas, but what I ended up doing is to simply pass the sort argument to Index.union for kind == 'special'. We do have to ignore the sort argument for [MultiIndex, RangeIndex, DatetimeIndex, CategoricalIndex] to avoid breaking tests and backward compatibility, but it doesn't make sense for these ones to not be sorted when joined anyway, in my opinion.

Performance

There are no changes to algorithms called. I still did some benchmarking to compare with my previous approach. The current solution didn't have any negative performance impact.

@AlexKirko

This comment has been minimized.

pandas/core/indexes/api.py Outdated Show resolved Hide resolved
@AlexKirko
Copy link
Member Author

Researching. Fiddling with pd.concat and union_indexes affects a bunch of stuff. Let's see if I can get this bugfix to work without introducing undesired behavior elsewhere.

@AlexKirko AlexKirko marked this pull request as draft July 3, 2020 06:51
@AlexKirko
Copy link
Member Author

AlexKirko commented Jul 3, 2020

The current idea is to support sort only for Index subclasses that are just an Index of a specific datatype.

Update: okay, seems to work. I'll see if I can benchmark tomorrow, and it should be ready for a review.
Update 2: benchmarked it, there were some decreases, so I did some more digging and ended up simply passing sort to Index.union instead of all the Index subclass detection.

@gfyoung gfyoung added Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Jul 3, 2020
If an Index isn't just a sliceable one-level array (Index), we should
not mess with sort to avoid breaking backwards compatibility.
):
ignore_sort = True
else:
ignore_sort = False
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we let any of these index type not get sorted, we'll break some tests where we assume that the results of joining on any of these come out sorted. Moreover, there isn't much point to having any of them unsorted except for pretty printing purposes, in my opinion.
If you think we should pass sort through for every type of Index subclass, please let me know.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you show what tests break?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would rather just pass thru and adjust the tests

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@AlexKirko AlexKirko marked this pull request as ready for review July 5, 2020 06:40
@AlexKirko
Copy link
Member Author

@gfyoung This PR is ready for a review now.

pandas/core/indexes/api.py Outdated Show resolved Hide resolved
pandas/core/indexes/api.py Outdated Show resolved Hide resolved
@pep8speaks
Copy link

pep8speaks commented Jul 7, 2020

Hello @AlexKirko! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-07-08 16:02:22 UTC

@AlexKirko
Copy link
Member Author

AlexKirko commented Jul 7, 2020

@jreback Changes made, ready for another review:

  1. Pass sort through to all Index subclasses. Alter tests to accomodate no automatic sorting. Add comments to changes where appropriate. After another look, I don't think this change will break anything for our users.
  2. Add test for append to reshape/test_concat.py. Use streamlined equivalent of the OP's test case, as the original case is overly verbose.
  3. Move whatsnew record to reshape section and remove mention of the private function.

@AlexKirko AlexKirko requested a review from jreback July 7, 2020 09:56
pandas/core/indexes/api.py Show resolved Hide resolved
@AlexKirko
Copy link
Member Author

All green.

@AlexKirko AlexKirko requested a review from jreback July 8, 2020 05:27
sort=None signifies that Index.union doesn't gurarantee sorting
or error by default. It might mirror legacy behavior and leave
the Index unsorted in edge cases.
@jreback jreback added this to the 1.1 milestone Jul 8, 2020
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm ping on green.

@AlexKirko
Copy link
Member Author

@jreback
All green. There was some hardware incident that impacted Azure pipelines, but we are good now.

@AlexKirko AlexKirko requested a review from jreback July 9, 2020 09:44
@jreback jreback merged commit c21be05 into pandas-dev:master Jul 9, 2020
@jreback
Copy link
Contributor

jreback commented Jul 9, 2020

thanks very nice!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: DataFrame.append seems to be sorting columns even when sort=False
4 participants