ENH: Add dropna in groupby to allow NaN in keys #30584

charlesdong1991 · 2019-12-31T16:07:32Z

closes ENH: pivot/groupby index with nan #3729
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

Note that this PR will NOT fix the issue for pivot_table for now, the reason is that there is already an argument called dropna in pivot_table and it has slightly different meaning, currently it means: Do not include columns whose entries are all NaN.

I would propose a change in the follow-up PR for this since this is an api change: change the name of current dropna to drop_all_na maybe? and then add dropna to it and it is aligned with the dropna in groupby.

Summary:
After this PR, it will optional to inlcude NaN in group keys, e.g. below, and i also add example in docstring as well:

a = [['a', 'b', 12, 12, 12], ['a', None, 12.3, 233., 12], ['b', 'a', 123.23, 123, 1], ['a', 'b', 1, 1, 1.]]
df = pd.DataFrame(a, columns=['a', 'b', 'c', 'd', 'e'])

df.groupby(by=['a', 'b']).sum()

will get

with dropna=False,

df.groupby(by=['a', 'b'], dropna=False).sum()

For Series, it is the same:

s = pd.Series([1, 2, 3, 3], index=["a", 'a', 'b', np.nan])

s.groupby(level=0).sum()
s.groupby(level=0, dropna=False).sum()

jreback

can you show some examples of using this (in the top of the PR). (ultimately these would need to become doc-examples).

jreback · 2020-01-01T02:19:16Z

doc/source/whatsnew/v1.0.0.rst

@@ -549,6 +549,7 @@ Other API changes
  Supplying anything else than ``how`` to ``**kwargs`` raised a ``TypeError`` previously (:issue:`29388`)
 - When testing pandas, the new minimum required version of pytest is 5.0.1 (:issue:`29664`)
 - :meth:`Series.str.__iter__` was deprecated and will be removed in future releases (:issue:`28277`).
+- :meth:`DataFrame.groupby` and :meth:`Series.groupby` have gained ``dropna`` argument in order to allow ``NaN`` values in group keys (:issue:`3729`)


would need a subsection, this is a major new feature

And i also updated on top in description.

jreback · 2020-01-01T02:20:08Z

pandas/core/algorithms.py

+    sort: bool = False,
+    na_sentinel: int = -1,
+    size_hint: Optional[int] = None,
+    dropna: Optional[bool] = None,


shouldn't this be just bool?

changed to bool

jreback · 2020-01-01T02:20:20Z

pandas/core/algorithms.py

@@ -630,6 +634,9 @@ def factorize(
        uniques, codes = safe_sort(
            uniques, codes, na_sentinel=na_sentinel, assume_unique=True, verify=False
        )
+    if dropna is False and (codes == na_sentinel).any():
+        uniques = np.append(uniques, [np.nan])


ideally we push this down to cython, but ok here

jreback · 2020-01-01T02:20:59Z

pandas/core/algorithms.py

@@ -630,6 +634,9 @@ def factorize(
        uniques, codes = safe_sort(
            uniques, codes, na_sentinel=na_sentinel, assume_unique=True, verify=False
        )
+    if dropna is False and (codes == na_sentinel).any():
+        uniques = np.append(uniques, [np.nan])


hmm, this should be a dtype appropriate for the dtype

i think dtype is defined below in _construct_data? also, if added, for categorical values, will get ValueError: Categorial categories cannot be null error.

right so instead of adding a null to the categories like you are doing, you just add the appropriate -1 entries in the codes which automatically handle the nullness

emm, this has to be a None or NA value in there. otherwise, output of uniques does not have na value, and also, my python will crash with FatalError somehow

pandas/core/generic.py

jreback · 2020-01-02T14:01:38Z

pandas/tests/groupby/test_groupby.py

@@ -2025,3 +2025,146 @@ def test_groupby_crash_on_nunique(axis):
        expected = expected.T

    tm.assert_frame_equal(result, expected)
+
+
+@pytest.mark.parametrize(


can you make a new file, test_groupby_dropna

yes, moved to test_groupby_dropna

jreback · 2020-01-02T14:02:02Z

pandas/tests/groupby/test_groupby.py

+        ),
+    ],
+)
+def test_groupby_dropna_multi_index_dataframe_agg(dropna, tuples, outputs):


can you add an example that uses NaT (datetime & timedelta)

added! see in test_groupby_dropna

jreback · 2020-01-02T14:02:39Z

pandas/tests/groupby/test_groupby.py

+        ["A", None, 12.3, 233.0, 12],
+        ["B", "A", 123.23, 123, 1],
+        ["A", "B", 1, 1, 1.0],
+    ]


what if we have NaN in 2 groups? does this work (e.g. different 1st level, but NaN) in 2nd. Also how is nan in first level?

added a lot more different scenarios!

jreback · 2020-01-02T14:03:18Z

doc/source/whatsnew/v1.0.0.rst

+We've added a ``dropna`` keyword to :meth:`DataFrame.groupby` and :meth:`Series.groupby` in order to
+allow ``NaN`` values in group keys. Users can define ``dropna`` to ``False`` if they want to include
+``NaN`` values in groupby keys. The default is set to ``True`` for ``dropna`` to keep backwards
+compatibility (:issue:`3729`)


add an example (like you have in the doc-strings)

also add examples in groupby.rst (and point from here)

added in both!

jreback · 2020-01-02T14:04:23Z

pandas/core/algorithms.py

@@ -630,6 +634,9 @@ def factorize(
        uniques, codes = safe_sort(
            uniques, codes, na_sentinel=na_sentinel, assume_unique=True, verify=False
        )
+    if dropna is False and (codes == na_sentinel).any():
+        uniques = np.append(uniques, [np.nan])


right so instead of adding a null to the categories like you are doing, you just add the appropriate -1 entries in the codes which automatically handle the nullness

jreback · 2020-01-02T14:04:53Z

pandas/core/frame.py

@@ -5648,6 +5648,41 @@ def update(
 Type
 Captive      210.0
 Wild         185.0
+
+We can also choose to include NaN in group keys or not by defining


defining -> setting

jreback · 2020-01-02T14:05:45Z

pandas/core/generic.py

@@ -7346,6 +7346,12 @@ def clip(
            If False: show all values for categorical groupers.

            .. versionadded:: 0.23.0
+        dropna : bool, default True
+            If True, and if group keys contain NaN values, NaN values together


don't say NaN, say NA values (e.g. can also be NaT or the new NA scalar)

jreback · 2020-01-02T14:06:30Z

pandas/core/algorithms.py

+    values,
+    sort: bool = False,
+    na_sentinel: int = -1,
+    size_hint: Optional[int] = None,


can you add some independent tests for factorize with dropna

emm, I think i added two test cases already in test_algos. Is it what you mean? or you want to have different test cases?

charlesdong1991 · 2020-04-27T07:31:41Z

ping @jreback while waiting for other reviews

p.s. somehow, there are less-than-usual CI checks after a rebase (i expect at least 13 or so, but see only 6, seems a lot of checks on other OS or python version are not triggered), not sure how/why, and pls let me know if another rebase is needed to see if the change is green on all.

jreback · 2020-04-27T12:58:35Z

ping @jreback while waiting for other reviews

p.s. somehow, there are less-than-usual CI checks after a rebase (i expect at least 13 or so, but see only 6, seems a lot of checks on other OS or python version are not triggered), not sure how/why, and pls let me know if another rebase is needed to see if the change is green on all.

hmm, try rebasing again, this happened once before then resolved itself.

charlesdong1991 · 2020-04-27T14:02:49Z

emm, weird, indeed seems it resolves itself. @jreback

charlesdong1991 · 2020-04-28T10:21:43Z

ping for reviews ^^

@jreback @jbrockmendel @TomAugspurger @jorisvandenbossche

TomAugspurger · 2020-04-28T15:50:57Z

Don't wait around on me.

…

On Tue, Apr 28, 2020 at 5:21 AM Kaiqi Dong ***@***.***> wrote: ping for reviews ^^ @jreback <https://github.com/jreback> @jbrockmendel <https://github.com/jbrockmendel> @TomAugspurger <https://github.com/TomAugspurger> @jorisvandenbossche <https://github.com/jorisvandenbossche> — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#30584 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAKAOIQFPAQ3UJ52HM43DWLRO2U4PANCNFSM4KBWP3MQ> .

charlesdong1991 · 2020-05-01T18:20:34Z

maybe @WillAyd @jbrockmendel could take a look for some feedbacks?

WillAyd

I think this is a nice change. @jreback

jreback · 2020-05-06T22:20:47Z

@charlesdong1991 can you merge master and ping on green.

charlesdong1991 · 2020-05-07T07:50:10Z

ping @jreback @WillAyd

jreback · 2020-05-09T20:10:53Z

thanks @charlesdong1991 very nice! this has been a long time requested feature!

senegrom · 2020-05-25T12:19:13Z

stupid question: how does that feature propagate into the standard pandas branches? any timeframe?

jsignell · 2020-05-28T18:48:46Z

My understanding is this is slated for inclusion in pandas 1.1.0, but I'm not sure when that is planned for.

charlesdong1991 added 6 commits December 3, 2018 17:43

remove \n from docstring

7e461a1

fix conflicts

1314059

Merge remote-tracking branch 'upstream/master'

8bcb313

Merge remote-tracking branch 'upstream/master' into fix_issue_3729

13b03a8

fix issue 3729

98f6127

fix conflicts

d5fd74c

charlesdong1991 changed the title ~~ENH: Add dropna in groupby to allow NaN in keys~~ [WIP] ENH: Add dropna in groupby to allow NaN in keys Dec 31, 2019

not check type

eb717ec

charlesdong1991 changed the title ~~[WIP] ENH: Add dropna in groupby to allow NaN in keys~~ ENH: Add dropna in groupby to allow NaN in keys Dec 31, 2019

charlesdong1991 added 2 commits December 31, 2019 19:35

Add groupby test for Series

de2ee5d

Add whatsnew note

def05cc

jreback added Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Jan 1, 2020

jreback requested changes Jan 1, 2020

View reviewed changes

charlesdong1991 added 8 commits January 1, 2020 10:30

Code change based on JR review

2888807

fix conflicts

b357659

add forgotten commits

dc4fef1

add forgotten commit

25482ec

Add dropna for series

015336d

add doc example for Series

ac2a79f

Add level example for series groupby

eb9a6f7

Add doc example for frame groupby

ffb70f8

charlesdong1991 requested a review from jreback January 1, 2020 13:35

jreback requested changes Jan 2, 2020

View reviewed changes

charlesdong1991 added 6 commits January 2, 2020 19:08

Code change based on JR reviews

b0e3cce

add doc

a1d5510

move doc

11ef56a

NaN to NA

b247a8b

Merge remote-tracking branch 'upstream/master' into fix_issue_3729

7cb027c

Fix linting

d730c4a

charlesdong1991 added 2 commits April 27, 2020 08:28

Merge remote-tracking branch 'upstream/master' into fix_issue_3729

9fec9a8

Doc fixup

ffbae76

charlesdong1991 requested a review from jreback April 27, 2020 07:33

Merge remote-tracking branch 'upstream/master' into fix_issue_3729

ef90d7c

rebase and resolve conflict

e219748

jreback mentioned this pull request Apr 29, 2020

ENH: pivot/groupby index with nan #3729

Closed

WillAyd approved these changes May 6, 2020

View reviewed changes

jreback approved these changes May 6, 2020

View reviewed changes

charlesdong1991 added 2 commits May 7, 2020 08:16

Merge remote-tracking branch 'upstream/master' into fix_issue_3729

2940908

try merge master again

4ea6aa0

jreback merged commit 88d5f12 into pandas-dev:master May 9, 2020

rhshadrach pushed a commit to rhshadrach/pandas that referenced this pull request May 10, 2020

ENH: Add dropna in groupby to allow NaN in keys (pandas-dev#30584)

61b57a7

charlesdong1991 mentioned this pull request May 12, 2020

CI: test_object_factorize_dropna failing in MacPython/pandas-wheels #34130

Closed

charlesdong1991 deleted the fix_issue_3729 branch May 14, 2020 20:55

jorisvandenbossche mentioned this pull request Jun 20, 2020

ENH: support missing values in dissolve geopandas/geopandas#1477

Closed

devin-petersohn mentioned this pull request Jul 17, 2020

[FEATURE] Pivot implementation modin-project/modin#1645

Merged

5 tasks

jorisvandenbossche mentioned this pull request Aug 11, 2020

DOC: document dropna kwarg of pd.factorize #35667

Closed

rhshadrach mentioned this pull request Sep 9, 2022

BUG: groupby doesn't distinguish between different kinds of null values #48476

Open

coroa mentioned this pull request Jun 25, 2023

isna does not work with explicit MultiIndex nan-representation coroa/pandas-indexing#25

Open

ENH: Add dropna in groupby to allow NaN in keys #30584

ENH: Add dropna in groupby to allow NaN in keys #30584

Conversation

charlesdong1991 commented Dec 31, 2019 • edited Loading

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

charlesdong1991 commented Apr 27, 2020

jreback commented Apr 27, 2020

charlesdong1991 commented Apr 27, 2020

charlesdong1991 commented Apr 28, 2020

TomAugspurger commented Apr 28, 2020 via email

charlesdong1991 commented May 1, 2020

WillAyd left a comment

Choose a reason for hiding this comment

jreback commented May 6, 2020

charlesdong1991 commented May 7, 2020

jreback commented May 9, 2020

senegrom commented May 25, 2020

jsignell commented May 28, 2020

charlesdong1991 commented Dec 31, 2019 •

edited

Loading