Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Categorical.(get|from)_dummies #34426

Closed
wants to merge 47 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
47 commits
Select commit Hold shift + click to select a range
b5ab7f2
Categorical (to|from)_dummies methods
clbarnes May 27, 2020
f937c96
Tests: Categorical.(to|from)_dummies
clbarnes May 28, 2020
dd14132
Add reference to Categorical.to_dummies to get_dummies
clbarnes May 28, 2020
9dc9da5
whatsnew: add issue number to Categorical.(to|from)_dummies
clbarnes May 28, 2020
ac9cec2
Review comments for dummies tests
clbarnes May 28, 2020
0459cb1
Review comments for dummies implementation
clbarnes May 28, 2020
65e68c2
dummies review comments
clbarnes May 29, 2020
1334026
User guide: Describe Categorical.(to|from)_dummies
clbarnes May 29, 2020
c2240b6
Fix user guide errors
clbarnes May 29, 2020
66771bf
Fix numpy element from sequence error
clbarnes May 29, 2020
4e769da
Test to_dummies column type cast
clbarnes May 29, 2020
fe002af
Test review comments
clbarnes May 29, 2020
097f2c6
Review comments for implementation
clbarnes May 29, 2020
afe8eda
Fix doctest for missing values
clbarnes May 29, 2020
e78158e
xfail for Categorical from sparse
clbarnes May 29, 2020
4fb1e5e
Fix tests
clbarnes May 29, 2020
9fa5494
fix isna
clbarnes May 29, 2020
61567fd
Explicit integer type cast
clbarnes May 29, 2020
5d724cc
Test categorical <-> dummies roundtrip
clbarnes May 29, 2020
1182ce5
more type casts
clbarnes May 29, 2020
04ca72a
Add wiki link for dummy variables
clbarnes Sep 17, 2020
6e4f71a
Remove deprecated numpy <v1.16 check
clbarnes Sep 17, 2020
a761baf
isort fix
clbarnes Sep 17, 2020
ed58c77
undo changes to whatsnew v1.1.0
clbarnes Sep 17, 2020
741cf8f
whatsnew/v1.2.0: Categorical get/from dummies
clbarnes Sep 17, 2020
6f199b6
Update user_guide/categorical docs
clbarnes Sep 17, 2020
034f8e1
Reference Categorical.get_dummies in reshape.py
clbarnes Sep 17, 2020
b80f089
Categorical->dummies more like get_dummies
clbarnes Sep 17, 2020
0eb936f
categorical tests
clbarnes Sep 17, 2020
8f212e1
isort pandas_web
clbarnes Sep 17, 2020
bda5265
fix _get_dummies_1d import path
clbarnes Sep 17, 2020
6e6ddda
categorical.test_api: to->get dummies
clbarnes Sep 17, 2020
9fcebf0
isort pandas_web
clbarnes Sep 17, 2020
b9908c4
fix typos in categorical doctests
clbarnes Sep 17, 2020
faeec41
isort test_datetime
clbarnes Sep 17, 2020
e11f28e
use get_dummies instead of _get_dummies_1d
clbarnes Sep 17, 2020
742c940
Reference get_dummies/ from_dummies in reshaping docs
clbarnes Sep 17, 2020
722137d
use prefix in from_dummies
clbarnes Sep 17, 2020
4945ba8
document prefix handling in categorical.rst
clbarnes Sep 17, 2020
1f98233
Lower-memory impl for Categorical.from_dummies
clbarnes Sep 17, 2020
ff01048
remove comment about use of _get_dummies_1d
clbarnes Sep 22, 2020
604b839
type-annotate get/from_dummies
clbarnes Sep 22, 2020
c71e807
split overlong line
clbarnes Sep 22, 2020
6f9272a
blacken
clbarnes Sep 22, 2020
8fd4b72
use f-strings
clbarnes Sep 22, 2020
534bc33
add some typing
clbarnes Sep 22, 2020
0facec6
remove unnecessary .values
clbarnes Sep 22, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
39 changes: 39 additions & 0 deletions doc/source/user_guide/categorical.rst
Original file line number Diff line number Diff line change
Expand Up @@ -127,6 +127,45 @@ This conversion is likewise done column by column:
df_cat['A']
df_cat['B']

Dummy / indicator / one-hot encoded variables
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Some operations, like regression and classification,
encodes a single categorical variable as a column for each category,
with each row having False in all but one column (True).
These are called `dummy variables <https://en.wikipedia.org/wiki/Dummy_variable_(statistics)>`_, or one-hot encoding.
:class:`pandas.Categorical` objects can easily be converted to and from such an encoding.

:meth:`pandas.Categorical.get_dummies` produces a dataframe of dummy variables.
It works in the same way and supports most of the same arguments as :func:`pandas.get_dummies`.

.. ipython:: python

cat = pd.Categorical(["a", "b", "b", "c"])
cat

cat.get_dummies()

The :meth:`pandas.Categorical.from_dummies` class method accepts a dataframe
whose dtypes are coercible to boolean, and an ``ordered`` argument
for whether the resulting ``Categorical`` should be considered ordered
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need to have some good cross links to the current get_dummies section

otherwise this is very confusing

i would prefer that these are actually in the get_dummies with just a small note here

(like the ``Categorical`` constructor).
A column with a NA index will be ignored.
Any row which is entirely falsey, or has a missing value,
will be uncategorised.
In the same way that :func:`pandas.get_dummies` can add a prefix to string category names,
:meth:`~pandas.Categorical.from_dummies` can filter a dataframe for columns with a prefix:
the resulting ``Categorical`` will have the prefix stripped from its categories.

.. ipython:: python

dummies = pd.get_dummies(["a", "b", "b", "c"], prefix="cat")
dummies

pd.Categorical.from_dummies(dummies, prefix="cat")


.. versionadded:: 1.2.0

Controlling behavior
~~~~~~~~~~~~~~~~~~~~
Expand Down
11 changes: 10 additions & 1 deletion doc/source/user_guide/reshaping.rst
Original file line number Diff line number Diff line change
Expand Up @@ -606,7 +606,7 @@ This function is often used along with discretization functions like ``cut``:

pd.get_dummies(pd.cut(values, bins))

See also :func:`Series.str.get_dummies <pandas.Series.str.get_dummies>`.
See also :func:`Series.str.get_dummies <pandas.Series.str.get_dummies>` and :func:`Categorical.get_dummies <pandas.Categorical.get_dummies>`.

:func:`get_dummies` also accepts a ``DataFrame``. By default all categorical
variables (categorical in the statistical sense, those with `object` or
Expand Down Expand Up @@ -679,6 +679,15 @@ To choose another dtype, use the ``dtype`` argument:

pd.get_dummies(df, dtype=bool).dtypes

A :class:`~pandas.Categorical` can be recovered from a :class:`~pandas.DataFrame` of such dummy variables using :meth:`~pandas.Categorical.from_dummies`.
Use the ``prefix`` and ``prefix_sep`` arguments to select and rename columns which have had a prefix applied in the same way as :class:`~pandas.get_dummies` does.

.. ipython:: python

df = pd.get_dummies(list("abca"))

pd.Categorical.from_dummies(df)


.. _reshaping.factorize:

Expand Down
1 change: 1 addition & 0 deletions doc/source/whatsnew/v1.2.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -120,6 +120,7 @@ Other enhancements
- `Styler` now allows direct CSS class name addition to individual data cells (:issue:`36159`)
- :meth:`Rolling.mean()` and :meth:`Rolling.sum()` use Kahan summation to calculate the mean to avoid numerical problems (:issue:`10319`, :issue:`11645`, :issue:`13254`, :issue:`32761`, :issue:`36031`)
- :meth:`DatetimeIndex.searchsorted`, :meth:`TimedeltaIndex.searchsorted`, :meth:`PeriodIndex.searchsorted`, and :meth:`Series.searchsorted` with datetimelike dtypes will now try to cast string arguments (listlike and scalar) to the matching datetimelike type (:issue:`36346`)
- :meth:`Categorical.from_dummies` and :meth:`Categorical.get_dummies` convert between :class:`Categorical` and :class:`DataFrame` objects of dummy variables.

.. _whatsnew_120.api_breaking.python:

Expand Down
220 changes: 219 additions & 1 deletion pandas/core/arrays/categorical.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
from functools import partial
import operator
from shutil import get_terminal_size
from typing import Dict, Hashable, List, Type, Union, cast
from typing import TYPE_CHECKING, Any, Dict, Hashable, List, Optional, Type, Union, cast
from warnings import warn

import numpy as np
Expand Down Expand Up @@ -55,6 +55,9 @@

from pandas.io.formats import console

if TYPE_CHECKING:
from pandas._typing import DataFrame # noqa: F401


def _cat_compare_op(op):
opname = f"__{op.__name__}__"
Expand Down Expand Up @@ -370,6 +373,221 @@ def __init__(
self._dtype = self._dtype.update_dtype(dtype)
self._codes = coerce_indexer_dtype(codes, dtype.categories)

@classmethod
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i agree with the others

either remove this or make it identical to get_dummies

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the discussion around this was more about the necessity/ implementation of to_dummies - there currently isn't an equivalent of from_dummies in pandas, which is what originally precipitated the issue. But note taken re. to_dummies if that was the intention.

def from_dummies(
cls,
dummies: "DataFrame",
ordered: Optional[bool] = None,
prefix: Optional[str] = None,
prefix_sep: str = "_",
fillna: Optional[bool] = None,
) -> "Categorical":
"""Create a `Categorical` using a ``DataFrame`` of dummy variables.

Can use a subset of columns based on the ``prefix``
and ``prefix_sep`` parameters.

The ``DataFrame`` must have no more than one truthy value per row.
The columns of the ``DataFrame`` become the categories of the `Categorical`.
A column whose header is NA will be dropped:
any row containing a NA value will be uncategorised.

Parameters
----------
dummies : DataFrame
dtypes of columns with non-NA headers must be coercible to bool.
Sparse dataframes are not supported.
ordered : bool
Whether or not this Categorical is ordered.
prefix : optional str
Only take columns whose names are strings starting
with this prefix and ``prefix_sep``,
stripping those elements from the resulting category names.
prefix_sep : str, default "_"
If ``prefix`` is not ``None``, use as the separator
between the prefix and the final name of the category.
fillna : optional bool, default None
How to handle NA values.
If ``True`` or ``False``, NA is filled with that value.
If ``None``, raise a ValueError if there are any NA values.

Raises
------
ValueError
If a sample belongs to >1 category
clbarnes marked this conversation as resolved.
Show resolved Hide resolved

Returns
-------
Categorical

Examples
--------
>>> simple = pd.DataFrame(np.eye(3), columns=["a", "b", "c"])
>>> Categorical.from_dummies(simple)
[a, b, c]
Categories (3, object): [a, b, c]

>>> nan_col = pd.DataFrame(np.eye(4), columns=["a", "b", np.nan, None])
>>> Categorical.from_dummies(nan_col)
[a, b, NaN, NaN]
Categories (2, object): [a, b]

>>> nan_cell = pd.DataFrame(
... [[1, 0, np.nan], [0, 1, 0], [0, 0, 1]],
... columns=["a", "b", "c"],
... )
>>> Categorical.from_dummies(nan_cell)
[NaN, b, c]
Categories (3, object): [a, b, c]

>>> multi = pd.DataFrame(
... [[1, 0, 1], [0, 1, 0], [0, 0, 1]],
... columns=["a", "b", "c"],
... )
>>> Categorical.from_dummies(multi)
Traceback (most recent call last):
...
ValueError: 1 record(s) belongs to multiple categories: [0]
"""
from pandas import Series
clbarnes marked this conversation as resolved.
Show resolved Hide resolved

to_drop = dummies.columns[isna(dummies.columns)]
if len(to_drop):
dummies = dummies.drop(columns=to_drop)

cats: List[Any]
if prefix is None:
cats = list(dummies.columns)
else:
pref = prefix + (prefix_sep or "")
cats = []
to_keep: List[str] = []
for c in dummies.columns:
if isinstance(c, str) and c.startswith(pref):
to_keep.append(c)
cats.append(c[len(pref) :])
dummies = dummies[to_keep]

df = dummies.astype("boolean")
if fillna is not None:
df = df.fillna(fillna)

row_totals = df.sum(axis=1, skipna=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why skipna?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is no explicit fillna policy given, and there are still NA values in the data, I'd prefer to raise an error rather than silently pass bunk data through. Therefore nans should not be skipped in this step so that they can be checked for in the next line.

if row_totals.isna().any():
raise ValueError("Unhandled NA values in dummy array")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this tested?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some holes left in the tests, on the to do list


multicat_rows = row_totals > 1
if multicat_rows.any():
raise ValueError(
f"{multicat_rows.sum()} record(s) belongs to multiple categories: "
f"{list(df.index[multicat_rows])}"
)

codes = Series(np.full(len(row_totals), np.nan), index=df.index, dtype="Int64")
codes[row_totals == 0] = -1
row_idx, code = np.nonzero(df)
codes[row_idx] = code

return cls.from_codes(codes.fillna(-1), cats, ordered=ordered)

def get_dummies(
clbarnes marked this conversation as resolved.
Show resolved Hide resolved
self,
prefix: Optional[str] = None,
prefix_sep: str = "_",
dummy_na: bool = False,
sparse: bool = False,
drop_first: bool = False,
dtype: Dtype = None,
) -> "DataFrame":
"""
Convert into dummy/indicator variables.

Parameters
----------
prefix : str, default None
String to append DataFrame column names.
prefix_sep : str, default '_'
If appending prefix, separator/delimiter to use.
dummy_na : bool, default False
Add a column to indicate NaNs, if False NaNs are ignored.
sparse : bool, default False
Whether the dummy-encoded columns should be backed by
a :class:`SparseArray` (True) or a regular NumPy array (False).
drop_first : bool, default False
Whether to get k-1 dummies out of k categorical levels by removing the
first level.
dtype : dtype, default np.uint8
Data type for new columns. Only a single dtype is allowed.

Returns
-------
DataFrame
Dummy-coded data.

See Also
--------
Series.str.get_dummies : Convert Series to dummy codes.
pandas.get_dummies : Convert categorical variable to dummy/indicator variables.

Examples
--------
>>> s = pd.Categorical(list('abca'))

>>> s.get_dummies()
a b c
0 1 0 0
1 0 1 0
2 0 0 1
3 1 0 0

>>> s1 = pd.Categorical(['a', 'b', np.nan])

>>> s1.get_dummies()
a b
0 1 0
1 0 1
2 0 0

>>> s1.get_dummies(dummy_na=True)
a b NaN
0 1 0 0
1 0 1 0
2 0 0 1

>>> pd.Categorical(list('abcaa')).get_dummies()
a b c
0 1 0 0
1 0 1 0
2 0 0 1
3 1 0 0
4 1 0 0

>>> pd.Categorical(list('abcaa')).get_dummies(drop_first=True)
b c
0 0 0
1 1 0
2 0 1
3 0 0
4 0 0

>>> pd.Categorical(list('abc')).get_dummies(dtype=float)
a b c
0 1.0 0.0 0.0
1 0.0 1.0 0.0
2 0.0 0.0 1.0
"""
from pandas import get_dummies

return get_dummies(
self,
prefix=prefix,
prefix_sep=prefix_sep,
dummy_na=dummy_na,
sparse=sparse,
drop_first=drop_first,
dtype=dtype,
)

@property
def dtype(self) -> CategoricalDtype:
"""
Expand Down
1 change: 1 addition & 0 deletions pandas/core/reshape/reshape.py
Original file line number Diff line number Diff line change
Expand Up @@ -768,6 +768,7 @@ def get_dummies(
See Also
--------
Series.str.get_dummies : Convert Series to dummy codes.
Categorical.get_dummies : Convert a Categorical array to dummy codes.

Examples
--------
Expand Down
32 changes: 31 additions & 1 deletion pandas/tests/arrays/categorical/test_api.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
import numpy as np
import pytest

from pandas import Categorical, CategoricalIndex, DataFrame, Index, Series
from pandas import Categorical, CategoricalIndex, DataFrame, Index, Series, get_dummies
import pandas._testing as tm
from pandas.core.arrays.categorical import recode_for_categories
from pandas.tests.arrays.categorical.common import TestCategorical
Expand Down Expand Up @@ -399,6 +399,36 @@ def test_remove_unused_categories(self):
out = cat.remove_unused_categories()
assert out.tolist() == val.tolist()

@pytest.mark.parametrize(
"vals",
[
["a", "b", "b", "a"],
["a", "b", "b", "a", np.nan],
[1, 1.5, "a", (1, "b")],
[1, 1.5, "a", (1, "b"), np.nan],
],
)
def test_get_dummies(self, vals):
# GH 8745
cats = Categorical(Series(vals))
tm.assert_equal(cats.get_dummies(), get_dummies(cats))

@pytest.mark.parametrize(
"vals",
[
["a", "b", "b", "a"],
["a", "b", "b", "a", np.nan],
[1, 1.5, "a", (1, "b")],
[1, 1.5, "a", (1, "b"), np.nan],
],
)
def test_dummies_roundtrip(self, vals):
# GH 8745
cats = Categorical(Series(vals))
dummies = cats.get_dummies()
cats2 = Categorical.from_dummies(dummies)
tm.assert_equal(cats, cats2)


class TestCategoricalAPIWithFactor(TestCategorical):
def test_describe(self):
Expand Down
Loading