Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Add dropna in groupby to allow NaN in keys #30584

Merged
merged 59 commits into from
May 9, 2020
Merged
Show file tree
Hide file tree
Changes from 57 commits
Commits
Show all changes
59 commits
Select commit Hold shift + click to select a range
7e461a1
remove \n from docstring
charlesdong1991 Dec 3, 2018
1314059
fix conflicts
charlesdong1991 Jan 19, 2019
8bcb313
Merge remote-tracking branch 'upstream/master'
charlesdong1991 Jul 30, 2019
13b03a8
Merge remote-tracking branch 'upstream/master' into fix_issue_3729
charlesdong1991 Dec 31, 2019
98f6127
fix issue 3729
charlesdong1991 Dec 31, 2019
d5fd74c
fix conflicts
charlesdong1991 Dec 31, 2019
eb717ec
not check type
charlesdong1991 Dec 31, 2019
de2ee5d
Add groupby test for Series
charlesdong1991 Dec 31, 2019
def05cc
Add whatsnew note
charlesdong1991 Dec 31, 2019
2888807
Code change based on JR review
charlesdong1991 Jan 1, 2020
b357659
fix conflicts
charlesdong1991 Jan 1, 2020
dc4fef1
add forgotten commits
charlesdong1991 Jan 1, 2020
25482ec
add forgotten commit
charlesdong1991 Jan 1, 2020
015336d
Add dropna for series
charlesdong1991 Jan 1, 2020
ac2a79f
add doc example for Series
charlesdong1991 Jan 1, 2020
eb9a6f7
Add level example for series groupby
charlesdong1991 Jan 1, 2020
ffb70f8
Add doc example for frame groupby
charlesdong1991 Jan 1, 2020
b0e3cce
Code change based on JR reviews
charlesdong1991 Jan 2, 2020
a1d5510
add doc
charlesdong1991 Jan 2, 2020
11ef56a
move doc
charlesdong1991 Jan 2, 2020
b247a8b
NaN to NA
charlesdong1991 Jan 2, 2020
7cb027c
Merge remote-tracking branch 'upstream/master' into fix_issue_3729
charlesdong1991 Jan 2, 2020
d730c4a
Fix linting
charlesdong1991 Jan 2, 2020
42c4934
fix rst issue
charlesdong1991 Jan 2, 2020
2ba79b9
fix rst issue
charlesdong1991 Jan 2, 2020
8b79b6c
refactor based on WA review
charlesdong1991 Jan 3, 2020
a4fdf2d
merge master and resolve conflicts
charlesdong1991 Feb 10, 2020
4ac15e3
remove blank
charlesdong1991 Feb 10, 2020
4ebbad3
code change on reviews
charlesdong1991 Feb 10, 2020
f141b80
use pd.testing
charlesdong1991 Feb 10, 2020
23ad19b
linting
charlesdong1991 Feb 10, 2020
bafc4a5
fixup
charlesdong1991 Feb 10, 2020
c98bafe
fixup
charlesdong1991 Feb 10, 2020
86a5958
doc
charlesdong1991 Feb 10, 2020
6cf31d7
validation
charlesdong1991 Feb 10, 2020
2b77f37
xfail windows
charlesdong1991 Feb 10, 2020
451ec97
rebase and resolve conflict
charlesdong1991 Feb 19, 2020
1089b18
fixup based on WA review
charlesdong1991 Feb 22, 2020
63da563
Merge remote-tracking branch 'upstream/master' into fix_issue_3729
charlesdong1991 Feb 22, 2020
1b3f22a
fix conflicts
charlesdong1991 Apr 7, 2020
3f360a9
reduce tests
charlesdong1991 Apr 7, 2020
5cabe4b
fix pep8
charlesdong1991 Apr 7, 2020
76ffb9f
Merge remote-tracking branch 'upstream/master' into fix_issue_3729
charlesdong1991 Apr 11, 2020
6c126c7
rebase and docs fixes
charlesdong1991 Apr 11, 2020
6d61d6a
fixup doc
charlesdong1991 Apr 11, 2020
3630e8b
remove inferred type
charlesdong1991 Apr 11, 2020
1cec7f1
better comment
charlesdong1991 Apr 11, 2020
1a1bb49
remove xfail
charlesdong1991 Apr 11, 2020
7ea2e79
use fixture
charlesdong1991 Apr 11, 2020
13b1e9a
coelse type for windows build
charlesdong1991 Apr 11, 2020
92a7eed
fixup
charlesdong1991 Apr 11, 2020
1315a9d
fixup
charlesdong1991 Apr 11, 2020
a7959d5
Merge remote-tracking branch 'upstream/master' into fix_issue_3729
charlesdong1991 Apr 15, 2020
9fec9a8
Merge remote-tracking branch 'upstream/master' into fix_issue_3729
charlesdong1991 Apr 27, 2020
ffbae76
Doc fixup
charlesdong1991 Apr 27, 2020
ef90d7c
Merge remote-tracking branch 'upstream/master' into fix_issue_3729
charlesdong1991 Apr 27, 2020
e219748
rebase and resolve conflict
charlesdong1991 Apr 27, 2020
2940908
Merge remote-tracking branch 'upstream/master' into fix_issue_3729
charlesdong1991 May 7, 2020
4ea6aa0
try merge master again
charlesdong1991 May 7, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 27 additions & 0 deletions doc/source/user_guide/groupby.rst
Original file line number Diff line number Diff line change
Expand Up @@ -199,6 +199,33 @@ For example, the groups created by ``groupby()`` below are in the order they app
df3.groupby(['X']).get_group('B')


.. _groupby.dropna:
jreback marked this conversation as resolved.
Show resolved Hide resolved

.. versionadded:: 1.1.0

GroupBy dropna
^^^^^^^^^^^^^^

By default ``NA`` values are excluded from group keys during the ``groupby`` operation. However,
in case you want to include ``NA`` values in group keys, you could pass ``dropna=False`` to achieve it.

.. ipython:: python

df_list = [[1, 2, 3], [1, None, 4], [2, 1, 3], [1, 2, 2]]
df_dropna = pd.DataFrame(df_list, columns=["a", "b", "c"])
jreback marked this conversation as resolved.
Show resolved Hide resolved

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you show the df_list, then put the actual groupbys in another ipython block

Copy link
Member Author

@charlesdong1991 charlesdong1991 Apr 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think you meant df_dropna? i showed it and put groupbys to another block

df_dropna

.. ipython:: python
charlesdong1991 marked this conversation as resolved.
Show resolved Hide resolved

# Default `dropna` is set to True, which will exclude NaNs in keys
df_dropna.groupby(by=["b"], dropna=True).sum()

# In order to allow NaN in keys, set `dropna` to False
df_dropna.groupby(by=["b"], dropna=False).sum()

The default setting of ``dropna`` argument is ``True`` which means ``NA`` are not included in group keys.


.. _groupby.attributes:

Expand Down
32 changes: 32 additions & 0 deletions doc/source/whatsnew/v1.1.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,37 @@ For example:
ser["2014"]
ser.loc["May 2015"]


.. _whatsnew_110.groupby_key:

Allow NA in groupby key
^^^^^^^^^^^^^^^^^^^^^^^^

With :ref:`groupby <groupby.dropna>` , we've added a ``dropna`` keyword to :meth:`DataFrame.groupby` and :meth:`Series.groupby` in order to
allow ``NA`` values in group keys. Users can define ``dropna`` to ``False`` if they want to include
``NA`` values in groupby keys. The default is set to ``True`` for ``dropna`` to keep backwards
compatibility (:issue:`3729`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a ref to the new doc-section.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

emm, what does this mean?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you add something like: :ref:`cookbook<cookbook.grouping>` to put in a link to the new groupby section you added about (obviously use the new ref you created)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the hint! I added one following other examples, pls let me know if it is okay now


.. ipython:: python

df_list = [[1, 2, 3], [1, None, 4], [2, 1, 3], [1, 2, 2]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use the same example as in the docs section (e.g. make the changes here as well)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, the example is the same as in groupby.rst if it is what you meant. I also make changes as suggested above.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same change here as above

df_dropna = pd.DataFrame(df_list, columns=["a", "b", "c"])

df_dropna

.. ipython:: python

# Default `dropna` is set to True, which will exclude NaNs in keys
df_dropna.groupby(by=["b"], dropna=True).sum()

# In order to allow NaN in keys, set `dropna` to False
df_dropna.groupby(by=["b"], dropna=False).sum()

The default setting of ``dropna`` argument is ``True`` which means ``NA`` are not included in group keys.

.. versionadded:: 1.1.0


.. _whatsnew_110.key_sorting:

Sorting with keys
Expand Down Expand Up @@ -83,6 +114,7 @@ When applied to a `DataFrame`, they key is applied per-column to all columns or
For more details, see examples and documentation in :meth:`DataFrame.sort_values`,
:meth:`Series.sort_values`, and :meth:`~DataFrame.sort_index`.


.. _whatsnew_110.timestamp_fold_support:

Fold argument support in Timestamp constructor
Expand Down
14 changes: 13 additions & 1 deletion pandas/core/algorithms.py
Original file line number Diff line number Diff line change
Expand Up @@ -509,7 +509,11 @@ def _factorize_array(
),
)
def factorize(
values, sort: bool = False, na_sentinel: int = -1, size_hint: Optional[int] = None
values,
sort: bool = False,
na_sentinel: int = -1,
size_hint: Optional[int] = None,
dropna: bool = True,
) -> Tuple[np.ndarray, Union[np.ndarray, ABCIndex]]:
"""
Encode the object as an enumerated type or categorical variable.
Expand Down Expand Up @@ -635,6 +639,14 @@ def factorize(
uniques, codes, na_sentinel=na_sentinel, assume_unique=True, verify=False
)

code_is_na = codes == na_sentinel
if not dropna and code_is_na.any():
# na_value is set based on the dtype of uniques, and compat set to False is
# because we do not want na_value to be 0 for integers
na_value = na_value_for_dtype(uniques.dtype, compat=False)
uniques = np.append(uniques, [na_value])
codes = np.where(code_is_na, len(uniques) - 1, codes)

uniques = _reconstruct_data(uniques, dtype, original)

# return original tenor
Expand Down
37 changes: 37 additions & 0 deletions pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -6140,6 +6140,41 @@ def update(
Type
Captive 210.0
Wild 185.0

We can also choose to include NA in group keys or not by setting
`dropna` parameter, the default setting is `True`:

>>> l = [[1, 2, 3], [1, None, 4], [2, 1, 3], [1, 2, 2]]
>>> df = pd.DataFrame(l, columns=["a", "b", "c"])

>>> df.groupby(by=["b"]).sum()
a c
b
1.0 2 3
2.0 2 5

>>> df.groupby(by=["b"], dropna=False).sum()
a c
b
1.0 2 3
2.0 2 5
NaN 1 4

>>> l = [["a", 12, 12], [None, 12.3, 33.], ["b", 12.3, 123], ["a", 1, 1]]
>>> df = pd.DataFrame(l, columns=["a", "b", "c"])

>>> df.groupby(by="a").sum()
b c
a
a 13.0 13.0
b 12.3 123.0

>>> df.groupby(by="a", dropna=False).sum()
b c
a
a 13.0 13.0
b 12.3 123.0
NaN 12.3 33.0
"""
)
@Appender(_shared_docs["groupby"] % _shared_doc_kwargs)
Expand All @@ -6153,6 +6188,7 @@ def groupby(
group_keys: bool = True,
squeeze: bool = False,
observed: bool = False,
dropna: bool = True,
) -> "DataFrameGroupBy":
from pandas.core.groupby.generic import DataFrameGroupBy

Expand All @@ -6170,6 +6206,7 @@ def groupby(
group_keys=group_keys,
squeeze=squeeze,
observed=observed,
dropna=dropna,
)

_shared_docs[
Expand Down
6 changes: 6 additions & 0 deletions pandas/core/generic.py
Original file line number Diff line number Diff line change
Expand Up @@ -7475,6 +7475,12 @@ def clip(
If False: show all values for categorical groupers.

.. versionadded:: 0.23.0
dropna : bool, default True
If True, and if group keys contain NA values, NA values together
with row/column will be dropped.
If False, NA values will also be treated as the key in groups

.. versionadded:: 1.1.0

Returns
-------
Expand Down
5 changes: 5 additions & 0 deletions pandas/core/groupby/groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -405,6 +405,7 @@ def __init__(
squeeze: bool = False,
observed: bool = False,
mutated: bool = False,
dropna: bool = True,
):

self._selection = selection
Expand All @@ -427,6 +428,7 @@ def __init__(
self.squeeze = squeeze
self.observed = observed
self.mutated = mutated
self.dropna = dropna
jreback marked this conversation as resolved.
Show resolved Hide resolved

if grouper is None:
from pandas.core.groupby.grouper import get_grouper
Expand All @@ -439,6 +441,7 @@ def __init__(
sort=sort,
observed=observed,
mutated=self.mutated,
dropna=self.dropna,
)

self.obj = obj
Expand Down Expand Up @@ -2580,6 +2583,7 @@ def get_groupby(
squeeze: bool = False,
observed: bool = False,
mutated: bool = False,
dropna: bool = True,
) -> GroupBy:

klass: Type[GroupBy]
Expand Down Expand Up @@ -2608,4 +2612,5 @@ def get_groupby(
squeeze=squeeze,
observed=observed,
mutated=mutated,
dropna=dropna,
)
14 changes: 12 additions & 2 deletions pandas/core/groupby/grouper.py
Original file line number Diff line number Diff line change
Expand Up @@ -134,7 +134,9 @@ def __new__(cls, *args, **kwargs):
cls = TimeGrouper
return super().__new__(cls)

def __init__(self, key=None, level=None, freq=None, axis=0, sort=False):
def __init__(
self, key=None, level=None, freq=None, axis=0, sort=False, dropna=True
jreback marked this conversation as resolved.
Show resolved Hide resolved
):
self.key = key
self.level = level
self.freq = freq
Expand All @@ -146,6 +148,7 @@ def __init__(self, key=None, level=None, freq=None, axis=0, sort=False):
self.indexer = None
self.binner = None
self._grouper = None
self.dropna = dropna

@property
def ax(self):
Expand All @@ -171,6 +174,7 @@ def _get_grouper(self, obj, validate: bool = True):
level=self.level,
sort=self.sort,
validate=validate,
dropna=self.dropna,
)
return self.binner, self.grouper, self.obj

Expand Down Expand Up @@ -283,6 +287,7 @@ def __init__(
sort: bool = True,
observed: bool = False,
in_axis: bool = False,
dropna: bool = True,
):
self.name = name
self.level = level
Expand All @@ -293,6 +298,7 @@ def __init__(
self.obj = obj
self.observed = observed
self.in_axis = in_axis
self.dropna = dropna
jreback marked this conversation as resolved.
Show resolved Hide resolved

# right place for this?
if isinstance(grouper, (Series, Index)) and name is None:
Expand Down Expand Up @@ -446,7 +452,9 @@ def _make_codes(self) -> None:
codes = self.grouper.codes_info
uniques = self.grouper.result_index
else:
codes, uniques = algorithms.factorize(self.grouper, sort=self.sort)
codes, uniques = algorithms.factorize(
self.grouper, sort=self.sort, dropna=self.dropna
)
uniques = Index(uniques, name=self.name)
self._codes = codes
self._group_index = uniques
Expand All @@ -465,6 +473,7 @@ def get_grouper(
observed: bool = False,
mutated: bool = False,
validate: bool = True,
dropna: bool = True,
) -> "Tuple[ops.BaseGrouper, List[Hashable], FrameOrSeries]":
"""
Create and return a BaseGrouper, which is an internal
Expand Down Expand Up @@ -655,6 +664,7 @@ def is_in_obj(gpr) -> bool:
sort=sort,
observed=observed,
in_axis=in_axis,
dropna=dropna,
)
if not isinstance(gpr, Grouping)
else gpr
Expand Down
30 changes: 30 additions & 0 deletions pandas/core/series.py
Original file line number Diff line number Diff line change
Expand Up @@ -1609,6 +1609,34 @@ def _set_name(self, name, inplace=False) -> "Series":
Captive 210.0
Wild 185.0
Name: Max Speed, dtype: float64

We can also choose to include `NA` in group keys or not by defining
`dropna` parameter, the default setting is `True`:

>>> ser = pd.Series([1, 2, 3, 3], index=["a", 'a', 'b', np.nan])
>>> ser.groupby(level=0).sum()
a 3
b 3
dtype: int64

>>> ser.groupby(level=0, dropna=False).sum()
a 3
b 3
NaN 3
dtype: int64

>>> arrays = ['Falcon', 'Falcon', 'Parrot', 'Parrot']
>>> ser = pd.Series([390., 350., 30., 20.], index=arrays, name="Max Speed")
>>> ser.groupby(["a", "b", "a", np.nan]).mean()
a 210.0
b 350.0
Name: Max Speed, dtype: float64

>>> ser.groupby(["a", "b", "a", np.nan], dropna=False).mean()
a 210.0
b 350.0
NaN 20.0
Name: Max Speed, dtype: float64
"""
)
@Appender(generic._shared_docs["groupby"] % _shared_doc_kwargs)
Expand All @@ -1622,6 +1650,7 @@ def groupby(
group_keys: bool = True,
squeeze: bool = False,
observed: bool = False,
dropna: bool = True,
) -> "SeriesGroupBy":
from pandas.core.groupby.generic import SeriesGroupBy

Expand All @@ -1639,6 +1668,7 @@ def groupby(
group_keys=group_keys,
squeeze=squeeze,
observed=observed,
dropna=dropna,
)

# ----------------------------------------------------------------------
Expand Down
Loading