Categorical.(get|from)_dummies #34426

clbarnes · 2020-05-28T10:40:18Z

Simply converting categorical variables to and from dummy variables.

closes API/ENH: from_dummies #8745
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

Intentionally smaller-scoped than #31795 (and indeed get_dummies) as a broadly useful MVP which can be chained with other basic functionality. The tests are fairly rudimentary and I welcome any edge cases which should be picked out.

For discussion

from_dummies

class method rather than free function: Keeps categorical-related functionality together, reduces surface in the global namespace, more obvious what is produced.
silently drop column with NA header: wasn't sure about this. Maybe it should raise a warning?
No handling of masked dataframes or dataframes with NA values
No subsetting or renaming of columns: callers can do this themselves

to_dummies

Name: I went for Categorical.to_dummies instead of matching the free function get_dummies. The symmetry of to/from aids understanding, and using get_ might imply A) that something is being got, which it isn't or B) signature/ feature parity with the existing method, which wasn't a design goal for me.
to_dummies return type: cls.to_dummies returns bools, where get_dummies returns uint8s by default, which doesn't make a lot of sense to me as we are representing boolean data (and they're the same in memory anyway). Dummy variables are generally used for regression where a continuous variable is required, so ints don't get us any closer to what we may want, and being able to index into the categories using the row may be valuable.
No dtype argument: I didn't see any benefit of cls.to_dummies(dtype=float) over cls.to_dummies().astype(float). The latter is more explicit, no slower, and minimises API surface.
No prefix, prefix_sep arguments: these unnecessarily assume string column headers. If someone wants to rename their columns, they can: IMO it's not a core requirement of this method.
dumma_na replaced with na_column: if we're including the argument, we may as well let the user decide what they want to call their column, using the dtype they prefer (e.g. "other", -1 etc). They can always supply np.nan if they want get_dummies-like behaviour.
No sparse argument: This I regret. Producing a sparse array would be valuable. However, it would drastically complicate the method, so I left it out for the MVP. The caller can always sparsify it after it's produced, so long as RAM isn't an issue for the temporary df.
No drop_first argument: If someone wants to drop one of their columns, they can (cls.to_dummies().drop(columns="my_col")): again, not a core requirement of this method. Broadly speaking, I'm not in favour of adding arguments to save the caller <=1 line of their own code unless there are e.g. speed gains.

MarcoGorelli

Cool, thanks @clbarnes !

pandas/core/arrays/categorical.py

MarcoGorelli · 2020-05-28T12:36:57Z

pandas/core/arrays/categorical.py

@@ -379,6 +382,110 @@ def __init__(
        self._dtype = self._dtype.update_dtype(dtype)
        self._codes = coerce_indexer_dtype(codes, dtype.categories)

+    @classmethod
+    def from_dummies(cls, dummies: "DataFrame", ordered=None):


Can you annotate ordered, as well as the return type?

Is there a preference for how to annotate the self return type? As I understand it, you can do it with a TypeVar, with a string, or by using from __future__ import annotations (although only in 3.7+).

pandas/core/arrays/categorical.py

MarcoGorelli · 2020-05-28T12:43:20Z

pandas/core/arrays/categorical.py

+        codes = (df.astype(int) * mult_by).sum(axis=1) - 1
+        codes[codes.isna()] = -1


I may have misunderstood what's happening here, but would it work to use pd.factorize?

I may be missing what factorize does, but I don't think it would save any effort here - factorize wants a 1D array (which this isn't until codes is instantiated), and then doesn't produce a Categorical unless given one, so the last line wouldn't change either.

OK yes, I was thinking something like

codes, _ = factorize(df.idxmax(axis=1))

, but then it'd be necessary to deal with the case when there's a row without any ones, which'd require more code and would be no shorter/clearer than what you've already got. So, ignore the factorize suggestion :)

Could of things then:

is astype(int) necessary? It should work without it

perhaps there could be a comment explaining why you do -1 at the end of this line

codes = (df.astype(int) * mult_by).sum(axis=1) - 1

?
Now that I've run the code, it's obvious what it does, but it wasn't when I was just reading it

You're right, the astype wasn't necessary. There's a comment now, best way I could think of explaining it but I can go for prose if you prefer.

pandas/tests/arrays/categorical/test_api.py

pandas/core/arrays/categorical.py

doc/source/whatsnew/v1.1.0.rst

pandas/core/arrays/categorical.py

jreback

will look soon

jreback · 2020-05-28T23:04:19Z

likely would want a doc-example of both of these near in https://pandas.pydata.org/docs/dev/user_guide/reshaping.html#computing-indicator-dummy-variables and/or links to categorical.rst

TomAugspurger · 2020-05-29T13:44:51Z

pandas/core/arrays/categorical.py

+        --------
+        :func:`pandas.get_dummies`
+        """
+        from pandas import DataFrame, CategoricalIndex, Series


Why is this not using the existing get_dummies implementation? This shold be identical to pd.get_dummies(Categorical), right?

TomAugspurger · 2020-05-29T13:46:16Z

pandas/core/arrays/categorical.py

+        codes[codes.isna()] = -1
+        return cls.from_codes(codes, df.columns.values, ordered=ordered)
+
+    def to_dummies(self, na_column=None) -> "DataFrame":


I think this should be called get_dummies I don't think the justification for deviating from that name is strong enough to warrant a different name.

I also think it should exactly match the signature of get_dummies.

TomAugspurger · 2020-05-29T13:47:31Z

pandas/core/arrays/categorical.py

+        A column whose header is NA will be dropped;
+        any row with a NA value will be uncategorised.


Need examples showing these.

TomAugspurger · 2020-05-29T13:47:46Z

pandas/core/arrays/categorical.py

+
+        Parameters
+        ----------
+        dummies : DataFrame of bool-like


Just the type on the first line. The structure can go on the second line.

pandas/core/arrays/categorical.py

TomAugspurger · 2020-05-29T13:49:21Z

pandas/tests/arrays/categorical/test_constructors.py

+    def test_from_dummies_gt1(self):
+        # GH 8745
+        dummies = DataFrame([[1, 0, 1], [0, 1, 0], [0, 0, 1]], columns=["a", "b", "c"])
+        with pytest.raises(ValueError):


Make sure we're raising an informative error message here.

TomAugspurger · 2020-05-29T13:49:48Z

pandas/tests/arrays/categorical/test_constructors.py

+
+    def test_from_dummies_nan(self):
+        # GH 8745
+        raw = ["a", "a", "b", "c", "c", "a", np.nan]


Is nan the only supported NA value that's dropped?

Good point, I'll change the selection of columns to drop to be dummies.columns[dummies.columns.isna()], and update parametrize this test over [np.nan, pd.NA, pd.NaT, None].

TomAugspurger · 2020-05-29T13:50:15Z

pandas/tests/arrays/categorical/test_constructors.py

@@ -643,3 +645,55 @@ def test_constructor_string_and_tuples(self):
        c = pd.Categorical(np.array(["c", ("a", "b"), ("b", "a"), "c"], dtype=object))
        expected_index = pd.Index([("a", "b"), ("b", "a"), "c"])
        assert c.categories.equals(expected_index)
+
+    def test_from_dummies(self):


Can you parametrize this over sparse = True/False?

For the sake of simplicity I have not made an attempt to support loading from a sparse array but I can give it a go!

clbarnes

Thanks for your comments @TomAugspurger ! I've addressed most of them but think the to_dummies and get_dummies shared functionality is probably worth discussion on the main thread.

clbarnes · 2020-05-29T14:03:30Z

pandas/tests/arrays/categorical/test_constructors.py

+
+    def test_from_dummies_nan(self):
+        # GH 8745
+        raw = ["a", "a", "b", "c", "c", "a", np.nan]


Good point, I'll change the selection of columns to drop to be dummies.columns[dummies.columns.isna()], and update parametrize this test over [np.nan, pd.NA, pd.NaT, None].

clbarnes · 2020-05-29T14:18:00Z

pandas/tests/arrays/categorical/test_constructors.py

@@ -643,3 +645,55 @@ def test_constructor_string_and_tuples(self):
        c = pd.Categorical(np.array(["c", ("a", "b"), ("b", "a"), "c"], dtype=object))
        expected_index = pd.Index([("a", "b"), ("b", "a"), "c"])
        assert c.categories.equals(expected_index)
+
+    def test_from_dummies(self):


For the sake of simplicity I have not made an attempt to support loading from a sparse array but I can give it a go!

clbarnes · 2020-05-29T14:18:51Z

pandas/tests/arrays/categorical/test_constructors.py

+    def test_from_dummies_gt1(self):
+        # GH 8745
+        dummies = DataFrame([[1, 0, 1], [0, 1, 0], [0, 0, 1]], columns=["a", "b", "c"])
+        with pytest.raises(ValueError):


clbarnes · 2020-05-29T15:21:27Z

I think this should be called get_dummies I don't think the justification for deviating from that name is strong enough to warrant a different name.

I also think it should exactly match the signature of get_dummies.

@TomAugspurger My initial implementation was a thin wrapper around get_dummies but given how simple the "raw" implementation is I didn't see a huge benefit to it. I could switch back, although I had wondered whether it might be better for for get_dummies to actually call to_dummies in some cases to avoid some logic.

I do feel more strongly about keeping the name and API different to get_dummies, though. get_ methods imply retrieving something from the object rather than converting it to a different form. from_ is IMO the most obvious name for the alternate constructor, and if from_ exists, the opposite ought to be to_ (as_ or into_ would also be OK but I feel they imply more mutation/ destructive operations). I personally have a minor nit with the naming of pandas' IO methods on this basis - read_csv and to_csv are from two different opposite pairs, which increased the time it took for me to remember them.

My justifications for not supporting all of the get_dummies interface are in the PR description. Supporting the whole get_dummies API in to_dummies would necessitate supporting all of the reverse operations in from_dummies (which is the key part of this PR; to_dummies is more of an afterthought), which I considered to be a large complexity overhead for not a lot of gain in utility. If we're not supporting the whole API, they shouldn't have the same name.

Would be interested for other parties to weigh in.

TomAugspurger · 2020-05-29T20:50:09Z

I do feel more strongly about keeping the name and API different to get_dummies

Yeah, your reasons for to_dummies all make sense. But in my mind it comes down to a values judgement between

Symmetry the two methods, which favors to_dummies over get_dummies, as to_dummies aligns better with from_dummies
Consistency with the established name get_dummies.

In my opinion, 2 wins out.

Supporting the whole get_dummies API in to_dummies would necessitate supporting all of the reverse operations in from_dummies

That would indeed be helpful. Which parameters / behaviors are especially hard to support? I'm happy to raise a NotImplementedError for things that could be done in a followup PR if anyone is interested.

jreback · 2020-06-01T01:25:36Z

doc/source/user_guide/categorical.rst

+Some operations, like regression and classification,
+encodes a single categorical variable as a column for each category,
+with each row having False in all but one column (True).
+These are called dummy variables, or one-hot encoding.


can u add a link to a wiki page (or scikit lean ok too)

jreback · 2020-06-01T01:27:11Z

doc/source/user_guide/categorical.rst

+
+The :meth:`pandas.Categorical.from_dummies` class method accepts a dataframe
+whose dtypes are coercible to boolean, and an ``ordered`` argument
+for whether the resulting ``Categorical`` should be considered ordered


you need to have some good cross links to the current get_dummies section

otherwise this is very confusing

i would prefer that these are actually in the get_dummies with just a small note here

jreback · 2020-06-01T01:28:56Z

pandas/core/arrays/categorical.py

@@ -379,6 +382,137 @@ def __init__(
        self._dtype = self._dtype.update_dtype(dtype)
        self._codes = coerce_indexer_dtype(codes, dtype.categories)

+    @classmethod


i agree with the others

either remove this or make it identical to get_dummies

I think the discussion around this was more about the necessity/ implementation of to_dummies - there currently isn't an equivalent of from_dummies in pandas, which is what originally precipitated the issue. But note taken re. to_dummies if that was the intention.

jreback · 2020-06-01T01:29:36Z

pandas/core/arrays/categorical.py

+        #  010            020    2    1
+        #  001 * 1,2,3 => 003 -> 3 -> 2 = correct codes
+        #  100            100    1    0
+        codes = ((df * mult_by).sum(axis=1, skipna=False) - 1).astype("Int64")


this is actually a memory heavy impl

Sure, I guess the alternative would be something like

codes = pd.Series(np.full(len(df), np.nan), dtype="Int64") row_totals = df.sum(axis=1, skipna=False) ... # multicat_rows check codes[row_totals == 0] = -1 row_idx, code = np.nonzero(row_totals > 0) codes[row_idx] = code

dsaxton · 2020-09-16T01:03:58Z

@clbarnes Is this still active? Can you fix merge conflicts if so?

clbarnes · 2020-09-17T09:17:13Z

Sorry I haven't been able to get back to this, I'll see what I can do today.

The tl;dr of the changes requested seems to be

Dummy creation class method should be called get_dummies
- This should be a thin wrapper around the get_dummies function and have an identical signature
Better cross-references between categorical and get_dummies docs
See above docstring requests
Rebase

clbarnes · 2020-09-17T13:05:10Z

I'm calling pandas.get_dummies in Categorical.get_dummies. I'd rather be calling pandas.core.reshape.reshape._get_dummies_1d, but that fails a lint, and I don't want to start renaming existing functions - any guidance on that?

clbarnes · 2020-09-17T15:08:11Z

Still to do:

Edge cases for np.nonzero with "boolean" dtype in Categorical.from_dummies
More tests, generally
Check cohesiveness of docs

pep8speaks · 2020-09-17T15:10:51Z

Hello @clbarnes! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-09-22 15:22:56 UTC

jreback · 2020-09-20T02:16:22Z

pandas/core/arrays/categorical.py

+        1  0.0  1.0  0.0
+        2  0.0  0.0  1.0
+        """
+        # Would be better to use pandas.core.reshape.reshape._get_dummies_1d


don't leave comments like this. why cant you use _get_dummies_1d ?

Importing a function with a leading underscore fails a lint. It's one of the home-rolled lints, so it can't be ignored with # noqa.

pandas/core/arrays/categorical.py

jreback · 2020-09-20T02:18:59Z

pandas/core/arrays/categorical.py

+        if fillna is not None:
+            df = df.fillna(fillna, inplace=copied)
+
+        row_totals = df.sum(axis=1, skipna=False)


why skipna?

If there is no explicit fillna policy given, and there are still NA values in the data, I'd prefer to raise an error rather than silently pass bunk data through. Therefore nans should not be skipped in this step so that they can be checked for in the next line.

jreback · 2020-09-20T02:19:06Z

pandas/core/arrays/categorical.py

+
+        row_totals = df.sum(axis=1, skipna=False)
+        if row_totals.isna().any():
+            raise ValueError("Unhandled NA values in dummy array")


is this tested?

There are some holes left in the tests, on the to do list

pandas/core/arrays/categorical.py

Simplistic implementation to go between dummy variables and Categoricals.

github-actions · 2020-10-23T00:17:00Z

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

arw2019 · 2020-11-06T16:33:44Z

@clbarnes how's this going? If you're interested in continuing can you merge master & resolve conflicts and address comments?

clbarnes · 2020-11-06T16:35:56Z

Sure, will do - did the first 90% but things got busy before I could tackle the last 90%. I'll try to get to this next week.

arw2019 · 2020-11-06T16:43:30Z

Sure, will do - did the first 90% but things got busy before I could tackle the last 90%. I'll try to get to this next week.

Sounds good (& no rush ofc). Ping us when you're ready for the next round of reviews

github-actions · 2020-12-07T00:14:12Z

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

jreback · 2021-02-11T01:35:59Z

@clbarnes still active? I think this was pretty close.

mroeschke · 2021-03-21T00:30:25Z

Thanks for the PR, but it appears to have gotten stale. Going to close for now but let us know if you can merge master and update and we'd be happy to reopen.

MarcoGorelli requested changes May 28, 2020

View reviewed changes

MarcoGorelli reviewed May 28, 2020

View reviewed changes

pandas/core/arrays/categorical.py Outdated Show resolved Hide resolved

jreback requested changes May 28, 2020

View reviewed changes

jreback added Categorical Categorical Data Type Enhancement labels May 28, 2020

clbarnes force-pushed the 8745-from_dummies branch 2 times, most recently from 52b6900 to 5ae8517 Compare May 29, 2020 13:21

TomAugspurger reviewed May 29, 2020

View reviewed changes

clbarnes commented May 29, 2020

View reviewed changes

jreback requested changes Jun 1, 2020

View reviewed changes

dsaxton mentioned this pull request Sep 16, 2020

CI: Add stale PR action #36336

Merged

dsaxton added the Stale label Sep 16, 2020

clbarnes force-pushed the 8745-from_dummies branch from b857b9b to 3f9573c Compare September 17, 2020 09:38

clbarnes changed the title ~~Categorical.(to|from)_dummies~~ Categorical.(get|from)_dummies Sep 17, 2020

github-actions bot removed the Stale label Sep 20, 2020

jreback requested changes Sep 20, 2020

View reviewed changes

clbarnes added 6 commits September 22, 2020 16:04

Categorical (to|from)_dummies methods

b5ab7f2

Simplistic implementation to go between dummy variables and Categoricals.

Tests: Categorical.(to|from)_dummies

f937c96

Add reference to Categorical.to_dummies to get_dummies

dd14132

whatsnew: add issue number to Categorical.(to|from)_dummies

9dc9da5

Review comments for dummies tests

ac9cec2

Review comments for dummies implementation

0459cb1

clbarnes added 16 commits September 22, 2020 16:04

categorical.test_api: to->get dummies

6e6ddda

isort pandas_web

9fcebf0

fix typos in categorical doctests

b9908c4

isort test_datetime

faeec41

use get_dummies instead of _get_dummies_1d

e11f28e

Reference get_dummies/ from_dummies in reshaping docs

742c940

use prefix in from_dummies

722137d

document prefix handling in categorical.rst

4945ba8

Lower-memory impl for Categorical.from_dummies

1f98233

remove comment about use of _get_dummies_1d

ff01048

type-annotate get/from_dummies

604b839

split overlong line

c71e807

blacken

6f9272a

use f-strings

8fd4b72

add some typing

534bc33

remove unnecessary .values

0facec6

clbarnes force-pushed the 8745-from_dummies branch from 765f001 to 0facec6 Compare September 22, 2020 15:22

github-actions bot added the Stale label Oct 23, 2020

arw2019 removed the Stale label Nov 6, 2020

github-actions bot added the Stale label Dec 7, 2020

MarcoGorelli mentioned this pull request Feb 12, 2021

STYLE use force-grid-wrap in isort #39780

Merged

1 task

mroeschke closed this Mar 21, 2021

pckSF mentioned this pull request Jun 9, 2021

Initial draft: from_dummies #41902

Merged

10 tasks

		codes = (df.astype(int) * mult_by).sum(axis=1) - 1
		codes[codes.isna()] = -1

		A column whose header is NA will be dropped;
		any row with a NA value will be uncategorised.

Categorical.(get|from)_dummies #34426

Categorical.(get|from)_dummies #34426

Conversation

clbarnes commented May 28, 2020

For discussion

from_dummies

to_dummies

MarcoGorelli left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MarcoGorelli May 28, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

jreback commented May 28, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

clbarnes left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

clbarnes commented May 29, 2020 • edited Loading

TomAugspurger commented May 29, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dsaxton commented Sep 16, 2020

clbarnes commented Sep 17, 2020 • edited Loading

clbarnes commented Sep 17, 2020

clbarnes commented Sep 17, 2020

pep8speaks commented Sep 17, 2020 • edited Loading

Comment last updated at 2020-09-22 15:22:56 UTC

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Oct 23, 2020

arw2019 commented Nov 6, 2020

clbarnes commented Nov 6, 2020

arw2019 commented Nov 6, 2020

github-actions bot commented Dec 7, 2020

jreback commented Feb 11, 2021

mroeschke commented Mar 21, 2021

MarcoGorelli May 28, 2020 •

edited

Loading

clbarnes left a comment •

edited

Loading

clbarnes commented May 29, 2020 •

edited

Loading

clbarnes commented Sep 17, 2020 •

edited

Loading

pep8speaks commented Sep 17, 2020 •

edited

Loading