Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Maintain column order with groupby.nth #22811

Merged
merged 4 commits into from
Nov 20, 2018

Conversation

reidy-p
Copy link
Contributor

@reidy-p reidy-p commented Sep 23, 2018

@pep8speaks
Copy link

pep8speaks commented Sep 23, 2018

Hello @reidy-p! Thanks for updating the PR.

Comment last updated on October 20, 2018 at 22:24 Hours UTC

@@ -497,7 +497,8 @@ def _set_group_selection(self):

if len(groupers):
# GH12839 clear selected obj cache when group selection changes
self._group_selection = ax.difference(Index(groupers)).tolist()
self._group_selection = ax.difference(Index(groupers),
sort=False).tolist()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Index.difference tries to sort its result by default and this means that sometimes the order of the columns was changed from the original DataFrame. I added a new sort parameter to Index.difference with a default of True to control this.

@codecov
Copy link

codecov bot commented Sep 23, 2018

Codecov Report

❗ No coverage uploaded for pull request base (master@960a73f). Click here to learn what that means.
The diff coverage is 81.81%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master   #22811   +/-   ##
=========================================
  Coverage          ?    92.2%           
=========================================
  Files             ?      169           
  Lines             ?    50927           
  Branches          ?        0           
=========================================
  Hits              ?    46955           
  Misses            ?     3972           
  Partials          ?        0
Flag Coverage Δ
#multiple 90.62% <81.81%> (?)
#single 42.3% <45.45%> (?)
Impacted Files Coverage Δ
pandas/core/indexes/multi.py 95.46% <100%> (ø)
pandas/core/groupby/groupby.py 96.47% <100%> (ø)
pandas/core/indexes/base.py 96.55% <66.66%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 960a73f...13a23f7. Read the comment docs.

@jreback jreback added the Bug label Sep 23, 2018
@jreback jreback added this to the 0.24.0 milestone Sep 23, 2018
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. there are a couple of other PRs out there which add a sort=True default to the set ops, though I think , though might be for intersection IIRC. can you see if you can locate / reference.

sort : bool, default True
Sort the resulting index if possible

.. versionadded:: 0.24.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you make sure this is added to all subclasses as well (mutli, interval) I think have there own impl.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you do this (in this PR), can ideally update the tests for .difference for all types to parameterize it where appropriate

try:
the_diff = sorting.safe_sort(the_diff)
except TypeError:
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add some tests in the index tests to exercise this (prob just parameterize the parameter in the tests)

@jreback
Copy link
Contributor

jreback commented Sep 25, 2018

the other issue is #17378

@reidy-p
Copy link
Contributor Author

reidy-p commented Sep 26, 2018

#20809 and #17878 are also related. I'm working on trying to expand the pull request to the other set ops but it will take a few more days at least.

@jreback
Copy link
Contributor

jreback commented Sep 27, 2018

just wanted to make you aware of the other issues

not super necessary to actually do it in this PR unless it’s straightforwars

@mroeschke
Copy link
Member

Mostly cross referencing for myself. #21603 should become a lot easier when this is completed; first and last could not be written strictly in terms of nth due to the column ordering.

@jreback
Copy link
Contributor

jreback commented Oct 1, 2018

lgtm. can you rebase, ping on green.

@reidy-p
Copy link
Contributor Author

reidy-p commented Oct 1, 2018

@jreback I've rebased now but do you want me to add sort=True to the other set operations in this PR too (or just the subclasses of Index) as you suggested above? I've started it but it needs a bit more work. I can put it in a new PR if it's easier.

@mroeschke I had the exact same thought when I was working on a PR for first and last which is why I opened this PR first!

@jreback
Copy link
Contributor

jreback commented Oct 7, 2018

PR too (or just the subclasses of Index) as you suggested above? I've started it but it needs a bit more work. I can put it in a new PR if it's easier.

yes new PR after this is merged.

sort : bool, default True
Sort the resulting index if possible

.. versionadded:: 0.24.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you do this (in this PR), can ideally update the tests for .difference for all types to parameterize it where appropriate

@reidy-p reidy-p force-pushed the nth_column_order branch 3 times, most recently from 8f881c8 to cdc638b Compare October 11, 2018 21:49
@reidy-p
Copy link
Contributor Author

reidy-p commented Oct 11, 2018

I have made some updates but I just realised that I need to add the new sort parameter to the tests for the subclasses of Index

@reidy-p reidy-p force-pushed the nth_column_order branch 3 times, most recently from 922f1eb to f7446b5 Compare October 20, 2018 22:23
@@ -1040,7 +1040,11 @@ def func(self, other):
'objects that have compatible dtypes')
raise TypeError(msg.format(op=op_name))

result = getattr(self._multiindex, op_name)(other._multiindex)
if op_name == 'difference':
result = getattr(self._multiindex, op_name)(other._multiindex,
Copy link
Contributor Author

@reidy-p reidy-p Oct 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit awkward at the moment because difference is the only set operation with the sort parameter. But if we add a sort parameter to the other set operations I think we can get rid of the if statement

@@ -2791,8 +2791,14 @@ def difference(self, other, sort=True):
labels=[[]] * self.nlevels,
names=result_names, verify_integrity=False)

difference = set(self._ndarray_values) - set(other._ndarray_values)
this = self._get_unique_index()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old way of doing this using set did not preserve the original order so I took this code from the difference method in pandas/core/indexes/base.py:

this = self._get_unique_index()
indexer = this.get_indexer(other)
indexer = indexer.take((indexer != -1).nonzero()[0])
label_diff = np.setdiff1d(np.arange(this.size), indexer,
assume_unique=True)
the_diff = this.values.take(label_diff)

rng1 = pd.date_range('1/1/2000', freq='D', periods=5, tz=tz)
@pytest.mark.parametrize("sort", [True, False])
def test_difference(self, tz, sort):
rng_dates = ['1/2/2000', '1/3/2000', '1/1/2000', '1/4/2000',
Copy link
Contributor Author

@reidy-p reidy-p Oct 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to ensure that the sort parameter was getting a proper test with unsorted data so I have rewritten some tests to have unsorted data (e.g., by manually specifying a list of dates here rather than using date_range). I have made similar changes to other existing tests.

@reidy-p reidy-p force-pushed the nth_column_order branch 2 times, most recently from bf9c4f9 to d4ec2d9 Compare October 28, 2018 16:00
@jreback
Copy link
Contributor

jreback commented Nov 1, 2018

can you rebase and fixup

@jreback
Copy link
Contributor

jreback commented Nov 18, 2018

rebased. though will look again.

@jreback jreback merged commit 71ba5bf into pandas-dev:master Nov 20, 2018
@jreback
Copy link
Contributor

jreback commented Nov 20, 2018

thanks @reidy-p nice job!

thoo added a commit to thoo/pandas that referenced this pull request Nov 20, 2018
…fixed

* upstream/master:
  DOC: more consistent flake8-commands in contributing.rst (pandas-dev#23724)
  DOC: Fixed the doctsring for _set_axis_name (GH 22895) (pandas-dev#22969)
  DOC: Improve GL03 message re: blank lines at end of docstrings. (pandas-dev#23649)
  TST: add tests for keeping dtype in Series.update (pandas-dev#23604)
  TST: For GH4861, Period and datetime in multiindex (pandas-dev#23776)
  TST: move .str-test to strings.py & parametrize it; precursor to pandas-dev#23582 (pandas-dev#23777)
  STY: isort tests/scalar, tests/tslibs, import libwindow instead of _window (pandas-dev#23787)
  BUG: fixed .str.contains(..., na=False) for categorical series (pandas-dev#22170)
  BUG: Maintain column order with groupby.nth (pandas-dev#22811)
  API/DEPR: replace kwarg "pat" with "sep" in str.[r]partition (pandas-dev#23767)
  CLN: Finish isort core (pandas-dev#23765)
  TST: Mark test_pct_max_many_rows with pytest.mark.single (pandas-dev#23799)
@reidy-p reidy-p deleted the nth_column_order branch December 8, 2018 13:05
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

nth() mixes column order
4 participants