Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Add dropna in groupby to allow NaN in keys #30584

Merged
merged 59 commits into from
May 9, 2020

Conversation

charlesdong1991
Copy link
Member

@charlesdong1991 charlesdong1991 commented Dec 31, 2019

Note that this PR will NOT fix the issue for pivot_table for now, the reason is that there is already an argument called dropna in pivot_table and it has slightly different meaning, currently it means: Do not include columns whose entries are all NaN.

I would propose a change in the follow-up PR for this since this is an api change: change the name of current dropna to drop_all_na maybe? and then add dropna to it and it is aligned with the dropna in groupby.

Summary:
After this PR, it will optional to inlcude NaN in group keys, e.g. below, and i also add example in docstring as well:

a = [['a', 'b', 12, 12, 12], ['a', None, 12.3, 233., 12], ['b', 'a', 123.23, 123, 1], ['a', 'b', 1, 1, 1.]]
df = pd.DataFrame(a, columns=['a', 'b', 'c', 'd', 'e'])

df.groupby(by=['a', 'b']).sum()

will get
Screen Shot 2020-01-01 at 11 01 16 AM

with dropna=False,

df.groupby(by=['a', 'b'], dropna=False).sum()

Screen Shot 2020-01-01 at 11 01 23 AM

For Series, it is the same:

s = pd.Series([1, 2, 3, 3], index=["a", 'a', 'b', np.nan])

s.groupby(level=0).sum()
s.groupby(level=0, dropna=False).sum()

Screen Shot 2020-01-01 at 11 01 32 AM

@charlesdong1991 charlesdong1991 changed the title ENH: Add dropna in groupby to allow NaN in keys [WIP] ENH: Add dropna in groupby to allow NaN in keys Dec 31, 2019
@charlesdong1991 charlesdong1991 changed the title [WIP] ENH: Add dropna in groupby to allow NaN in keys ENH: Add dropna in groupby to allow NaN in keys Dec 31, 2019
@jreback jreback added Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Jan 1, 2020
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you show some examples of using this (in the top of the PR). (ultimately these would need to become doc-examples).

@@ -549,6 +549,7 @@ Other API changes
Supplying anything else than ``how`` to ``**kwargs`` raised a ``TypeError`` previously (:issue:`29388`)
- When testing pandas, the new minimum required version of pytest is 5.0.1 (:issue:`29664`)
- :meth:`Series.str.__iter__` was deprecated and will be removed in future releases (:issue:`28277`).
- :meth:`DataFrame.groupby` and :meth:`Series.groupby` have gained ``dropna`` argument in order to allow ``NaN`` values in group keys (:issue:`3729`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would need a subsection, this is a major new feature

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And i also updated on top in description.

sort: bool = False,
na_sentinel: int = -1,
size_hint: Optional[int] = None,
dropna: Optional[bool] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this be just bool?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to bool

@@ -630,6 +634,9 @@ def factorize(
uniques, codes = safe_sort(
uniques, codes, na_sentinel=na_sentinel, assume_unique=True, verify=False
)
if dropna is False and (codes == na_sentinel).any():
uniques = np.append(uniques, [np.nan])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ideally we push this down to cython, but ok here

@@ -630,6 +634,9 @@ def factorize(
uniques, codes = safe_sort(
uniques, codes, na_sentinel=na_sentinel, assume_unique=True, verify=False
)
if dropna is False and (codes == na_sentinel).any():
uniques = np.append(uniques, [np.nan])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, this should be a dtype appropriate for the dtype

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think dtype is defined below in _construct_data? also, if added, for categorical values, will get ValueError: Categorial categories cannot be null error.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right so instead of adding a null to the categories like you are doing, you just add the appropriate -1 entries in the codes which automatically handle the nullness

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

emm, this has to be a None or NA value in there. otherwise, output of uniques does not have na value, and also, my python will crash with FatalError somehow

pandas/core/generic.py Outdated Show resolved Hide resolved
@@ -2025,3 +2025,146 @@ def test_groupby_crash_on_nunique(axis):
expected = expected.T

tm.assert_frame_equal(result, expected)


@pytest.mark.parametrize(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you make a new file, test_groupby_dropna

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, moved to test_groupby_dropna

),
],
)
def test_groupby_dropna_multi_index_dataframe_agg(dropna, tuples, outputs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add an example that uses NaT (datetime & timedelta)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added! see in test_groupby_dropna

["A", None, 12.3, 233.0, 12],
["B", "A", 123.23, 123, 1],
["A", "B", 1, 1, 1.0],
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if we have NaN in 2 groups? does this work (e.g. different 1st level, but NaN) in 2nd. Also how is nan in first level?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a lot more different scenarios!

We've added a ``dropna`` keyword to :meth:`DataFrame.groupby` and :meth:`Series.groupby` in order to
allow ``NaN`` values in group keys. Users can define ``dropna`` to ``False`` if they want to include
``NaN`` values in groupby keys. The default is set to ``True`` for ``dropna`` to keep backwards
compatibility (:issue:`3729`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add an example (like you have in the doc-strings)

also add examples in groupby.rst (and point from here)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added in both!

@@ -630,6 +634,9 @@ def factorize(
uniques, codes = safe_sort(
uniques, codes, na_sentinel=na_sentinel, assume_unique=True, verify=False
)
if dropna is False and (codes == na_sentinel).any():
uniques = np.append(uniques, [np.nan])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right so instead of adding a null to the categories like you are doing, you just add the appropriate -1 entries in the codes which automatically handle the nullness

@@ -5648,6 +5648,41 @@ def update(
Type
Captive 210.0
Wild 185.0

We can also choose to include NaN in group keys or not by defining
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

defining -> setting

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rephrased

@@ -7346,6 +7346,12 @@ def clip(
If False: show all values for categorical groupers.

.. versionadded:: 0.23.0
dropna : bool, default True
If True, and if group keys contain NaN values, NaN values together
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't say NaN, say NA values (e.g. can also be NaT or the new NA scalar)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed!

values,
sort: bool = False,
na_sentinel: int = -1,
size_hint: Optional[int] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add some independent tests for factorize with dropna

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

emm, I think i added two test cases already in test_algos. Is it what you mean? or you want to have different test cases?

@charlesdong1991
Copy link
Member Author

ping @jreback while waiting for other reviews

p.s. somehow, there are less-than-usual CI checks after a rebase (i expect at least 13 or so, but see only 6, seems a lot of checks on other OS or python version are not triggered), not sure how/why, and pls let me know if another rebase is needed to see if the change is green on all.

@jreback
Copy link
Contributor

jreback commented Apr 27, 2020

ping @jreback while waiting for other reviews

p.s. somehow, there are less-than-usual CI checks after a rebase (i expect at least 13 or so, but see only 6, seems a lot of checks on other OS or python version are not triggered), not sure how/why, and pls let me know if another rebase is needed to see if the change is green on all.

hmm, try rebasing again, this happened once before then resolved itself.

@charlesdong1991
Copy link
Member Author

emm, weird, indeed seems it resolves itself. @jreback

@charlesdong1991
Copy link
Member Author

ping for reviews ^^

@jreback @jbrockmendel @TomAugspurger @jorisvandenbossche

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Apr 28, 2020 via email

@charlesdong1991
Copy link
Member Author

maybe @WillAyd @jbrockmendel could take a look for some feedbacks?

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a nice change. @jreback

@jreback
Copy link
Contributor

jreback commented May 6, 2020

@charlesdong1991 can you merge master and ping on green.

@charlesdong1991
Copy link
Member Author

ping @jreback @WillAyd

@jreback jreback merged commit 88d5f12 into pandas-dev:master May 9, 2020
@jreback
Copy link
Contributor

jreback commented May 9, 2020

thanks @charlesdong1991 very nice! this has been a long time requested feature!

@senegrom
Copy link

stupid question: how does that feature propagate into the standard pandas branches? any timeframe?

@jsignell
Copy link
Contributor

My understanding is this is slated for inclusion in pandas 1.1.0, but I'm not sure when that is planned for.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: pivot/groupby index with nan
7 participants