Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Categorical.(get|from)_dummies #34426

Closed
wants to merge 47 commits into from

Conversation

clbarnes
Copy link
Contributor

Simply converting categorical variables to and from dummy variables.

  • closes API/ENH: from_dummies #8745
  • tests added / passed
  • passes black pandas
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry

Intentionally smaller-scoped than #31795 (and indeed get_dummies) as a broadly useful MVP which can be chained with other basic functionality. The tests are fairly rudimentary and I welcome any edge cases which should be picked out.

For discussion

from_dummies

  • class method rather than free function: Keeps categorical-related functionality together, reduces surface in the global namespace, more obvious what is produced.
  • silently drop column with NA header: wasn't sure about this. Maybe it should raise a warning?
  • No handling of masked dataframes or dataframes with NA values
  • No subsetting or renaming of columns: callers can do this themselves

to_dummies

  • Name: I went for Categorical.to_dummies instead of matching the free function get_dummies. The symmetry of to/from aids understanding, and using get_ might imply A) that something is being got, which it isn't or B) signature/ feature parity with the existing method, which wasn't a design goal for me.
  • to_dummies return type: cls.to_dummies returns bools, where get_dummies returns uint8s by default, which doesn't make a lot of sense to me as we are representing boolean data (and they're the same in memory anyway). Dummy variables are generally used for regression where a continuous variable is required, so ints don't get us any closer to what we may want, and being able to index into the categories using the row may be valuable.
  • No dtype argument: I didn't see any benefit of cls.to_dummies(dtype=float) over cls.to_dummies().astype(float). The latter is more explicit, no slower, and minimises API surface.
  • No prefix, prefix_sep arguments: these unnecessarily assume string column headers. If someone wants to rename their columns, they can: IMO it's not a core requirement of this method.
  • dumma_na replaced with na_column: if we're including the argument, we may as well let the user decide what they want to call their column, using the dtype they prefer (e.g. "other", -1 etc). They can always supply np.nan if they want get_dummies-like behaviour.
  • No sparse argument: This I regret. Producing a sparse array would be valuable. However, it would drastically complicate the method, so I left it out for the MVP. The caller can always sparsify it after it's produced, so long as RAM isn't an issue for the temporary df.
  • No drop_first argument: If someone wants to drop one of their columns, they can (cls.to_dummies().drop(columns="my_col")): again, not a core requirement of this method. Broadly speaking, I'm not in favour of adding arguments to save the caller <=1 line of their own code unless there are e.g. speed gains.

Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, thanks @clbarnes !

pandas/core/arrays/categorical.py Show resolved Hide resolved
@@ -379,6 +382,110 @@ def __init__(
self._dtype = self._dtype.update_dtype(dtype)
self._codes = coerce_indexer_dtype(codes, dtype.categories)

@classmethod
def from_dummies(cls, dummies: "DataFrame", ordered=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you annotate ordered, as well as the return type?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a preference for how to annotate the self return type? As I understand it, you can do it with a TypeVar, with a string, or by using from __future__ import annotations (although only in 3.7+).

pandas/core/arrays/categorical.py Outdated Show resolved Hide resolved
Comment on lines 431 to 432
codes = (df.astype(int) * mult_by).sum(axis=1) - 1
codes[codes.isna()] = -1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may have misunderstood what's happening here, but would it work to use pd.factorize?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may be missing what factorize does, but I don't think it would save any effort here - factorize wants a 1D array (which this isn't until codes is instantiated), and then doesn't produce a Categorical unless given one, so the last line wouldn't change either.

Copy link
Member

@MarcoGorelli MarcoGorelli May 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK yes, I was thinking something like

codes, _ = factorize(df.idxmax(axis=1))

, but then it'd be necessary to deal with the case when there's a row without any ones, which'd require more code and would be no shorter/clearer than what you've already got. So, ignore the factorize suggestion :)

Could of things then:

  • is astype(int) necessary? It should work without it
  • perhaps there could be a comment explaining why you do -1 at the end of this line
        codes = (df.astype(int) * mult_by).sum(axis=1) - 1

?
Now that I've run the code, it's obvious what it does, but it wasn't when I was just reading it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, the astype wasn't necessary. There's a comment now, best way I could think of explaining it but I can go for prose if you prefer.

pandas/tests/arrays/categorical/test_api.py Outdated Show resolved Hide resolved
pandas/core/arrays/categorical.py Outdated Show resolved Hide resolved
doc/source/whatsnew/v1.1.0.rst Outdated Show resolved Hide resolved
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will look soon

@jreback jreback added Categorical Categorical Data Type Enhancement labels May 28, 2020
@jreback
Copy link
Contributor

jreback commented May 28, 2020

likely would want a doc-example of both of these near in https://pandas.pydata.org/docs/dev/user_guide/reshaping.html#computing-indicator-dummy-variables and/or links to categorical.rst

@clbarnes clbarnes force-pushed the 8745-from_dummies branch 2 times, most recently from 52b6900 to 5ae8517 Compare May 29, 2020 13:21
--------
:func:`pandas.get_dummies`
"""
from pandas import DataFrame, CategoricalIndex, Series
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this not using the existing get_dummies implementation? This shold be identical to pd.get_dummies(Categorical), right?

codes[codes.isna()] = -1
return cls.from_codes(codes, df.columns.values, ordered=ordered)

def to_dummies(self, na_column=None) -> "DataFrame":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be called get_dummies I don't think the justification for deviating from that name is strong enough to warrant a different name.

I also think it should exactly match the signature of get_dummies.

Comment on lines 394 to 395
A column whose header is NA will be dropped;
any row with a NA value will be uncategorised.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need examples showing these.


Parameters
----------
dummies : DataFrame of bool-like
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just the type on the first line. The structure can go on the second line.

pandas/core/arrays/categorical.py Show resolved Hide resolved
def test_from_dummies_gt1(self):
# GH 8745
dummies = DataFrame([[1, 0, 1], [0, 1, 0], [0, 0, 1]], columns=["a", "b", "c"])
with pytest.raises(ValueError):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure we're raising an informative error message here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do


def test_from_dummies_nan(self):
# GH 8745
raw = ["a", "a", "b", "c", "c", "a", np.nan]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is nan the only supported NA value that's dropped?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I'll change the selection of columns to drop to be dummies.columns[dummies.columns.isna()], and update parametrize this test over [np.nan, pd.NA, pd.NaT, None].

@@ -643,3 +645,55 @@ def test_constructor_string_and_tuples(self):
c = pd.Categorical(np.array(["c", ("a", "b"), ("b", "a"), "c"], dtype=object))
expected_index = pd.Index([("a", "b"), ("b", "a"), "c"])
assert c.categories.equals(expected_index)

def test_from_dummies(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you parametrize this over sparse = True/False?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the sake of simplicity I have not made an attempt to support loading from a sparse array but I can give it a go!

Copy link
Contributor Author

@clbarnes clbarnes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your comments @TomAugspurger ! I've addressed most of them but think the to_dummies and get_dummies shared functionality is probably worth discussion on the main thread.


def test_from_dummies_nan(self):
# GH 8745
raw = ["a", "a", "b", "c", "c", "a", np.nan]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I'll change the selection of columns to drop to be dummies.columns[dummies.columns.isna()], and update parametrize this test over [np.nan, pd.NA, pd.NaT, None].

@@ -643,3 +645,55 @@ def test_constructor_string_and_tuples(self):
c = pd.Categorical(np.array(["c", ("a", "b"), ("b", "a"), "c"], dtype=object))
expected_index = pd.Index([("a", "b"), ("b", "a"), "c"])
assert c.categories.equals(expected_index)

def test_from_dummies(self):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the sake of simplicity I have not made an attempt to support loading from a sparse array but I can give it a go!

def test_from_dummies_gt1(self):
# GH 8745
dummies = DataFrame([[1, 0, 1], [0, 1, 0], [0, 0, 1]], columns=["a", "b", "c"])
with pytest.raises(ValueError):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do

@clbarnes
Copy link
Contributor Author

clbarnes commented May 29, 2020

I think this should be called get_dummies I don't think the justification for deviating from that name is strong enough to warrant a different name.

I also think it should exactly match the signature of get_dummies.

@TomAugspurger My initial implementation was a thin wrapper around get_dummies but given how simple the "raw" implementation is I didn't see a huge benefit to it. I could switch back, although I had wondered whether it might be better for for get_dummies to actually call to_dummies in some cases to avoid some logic.

I do feel more strongly about keeping the name and API different to get_dummies, though. get_ methods imply retrieving something from the object rather than converting it to a different form. from_ is IMO the most obvious name for the alternate constructor, and if from_ exists, the opposite ought to be to_ (as_ or into_ would also be OK but I feel they imply more mutation/ destructive operations). I personally have a minor nit with the naming of pandas' IO methods on this basis - read_csv and to_csv are from two different opposite pairs, which increased the time it took for me to remember them.

My justifications for not supporting all of the get_dummies interface are in the PR description. Supporting the whole get_dummies API in to_dummies would necessitate supporting all of the reverse operations in from_dummies (which is the key part of this PR; to_dummies is more of an afterthought), which I considered to be a large complexity overhead for not a lot of gain in utility. If we're not supporting the whole API, they shouldn't have the same name.

Would be interested for other parties to weigh in.

@TomAugspurger
Copy link
Contributor

I do feel more strongly about keeping the name and API different to get_dummies

Yeah, your reasons for to_dummies all make sense. But in my mind it comes down to a values judgement between

  1. Symmetry the two methods, which favors to_dummies over get_dummies, as to_dummies aligns better with from_dummies
  2. Consistency with the established name get_dummies.

In my opinion, 2 wins out.

Supporting the whole get_dummies API in to_dummies would necessitate supporting all of the reverse operations in from_dummies

That would indeed be helpful. Which parameters / behaviors are especially hard to support? I'm happy to raise a NotImplementedError for things that could be done in a followup PR if anyone is interested.

Some operations, like regression and classification,
encodes a single categorical variable as a column for each category,
with each row having False in all but one column (True).
These are called dummy variables, or one-hot encoding.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can u add a link to a wiki page (or scikit lean ok too)


The :meth:`pandas.Categorical.from_dummies` class method accepts a dataframe
whose dtypes are coercible to boolean, and an ``ordered`` argument
for whether the resulting ``Categorical`` should be considered ordered
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need to have some good cross links to the current get_dummies section

otherwise this is very confusing

i would prefer that these are actually in the get_dummies with just a small note here

@@ -379,6 +382,137 @@ def __init__(
self._dtype = self._dtype.update_dtype(dtype)
self._codes = coerce_indexer_dtype(codes, dtype.categories)

@classmethod
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i agree with the others

either remove this or make it identical to get_dummies

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the discussion around this was more about the necessity/ implementation of to_dummies - there currently isn't an equivalent of from_dummies in pandas, which is what originally precipitated the issue. But note taken re. to_dummies if that was the intention.

# 010 020 2 1
# 001 * 1,2,3 => 003 -> 3 -> 2 = correct codes
# 100 100 1 0
codes = ((df * mult_by).sum(axis=1, skipna=False) - 1).astype("Int64")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is actually a memory heavy impl

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I guess the alternative would be something like

codes = pd.Series(np.full(len(df), np.nan), dtype="Int64")

row_totals = df.sum(axis=1, skipna=False)
... # multicat_rows check

codes[row_totals == 0] = -1
row_idx, code = np.nonzero(row_totals > 0)
codes[row_idx] = code

@dsaxton
Copy link
Member

dsaxton commented Sep 16, 2020

@clbarnes Is this still active? Can you fix merge conflicts if so?

@clbarnes
Copy link
Contributor Author

clbarnes commented Sep 17, 2020

Sorry I haven't been able to get back to this, I'll see what I can do today.

The tl;dr of the changes requested seems to be

  • Dummy creation class method should be called get_dummies
    • This should be a thin wrapper around the get_dummies function and have an identical signature
  • Better cross-references between categorical and get_dummies docs
  • See above docstring requests
  • Rebase

@clbarnes
Copy link
Contributor Author

I'm calling pandas.get_dummies in Categorical.get_dummies. I'd rather be calling pandas.core.reshape.reshape._get_dummies_1d, but that fails a lint, and I don't want to start renaming existing functions - any guidance on that?

@clbarnes clbarnes changed the title Categorical.(to|from)_dummies Categorical.(get|from)_dummies Sep 17, 2020
@clbarnes
Copy link
Contributor Author

Still to do:

  • Edge cases for np.nonzero with "boolean" dtype in Categorical.from_dummies
  • More tests, generally
  • Check cohesiveness of docs

@pep8speaks
Copy link

pep8speaks commented Sep 17, 2020

Hello @clbarnes! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-09-22 15:22:56 UTC

@github-actions github-actions bot removed the Stale label Sep 20, 2020
1 0.0 1.0 0.0
2 0.0 0.0 1.0
"""
# Would be better to use pandas.core.reshape.reshape._get_dummies_1d
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't leave comments like this. why cant you use _get_dummies_1d ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Importing a function with a leading underscore fails a lint. It's one of the home-rolled lints, so it can't be ignored with # noqa.

pandas/core/arrays/categorical.py Show resolved Hide resolved
pandas/core/arrays/categorical.py Outdated Show resolved Hide resolved
pandas/core/arrays/categorical.py Outdated Show resolved Hide resolved
if fillna is not None:
df = df.fillna(fillna, inplace=copied)

row_totals = df.sum(axis=1, skipna=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why skipna?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is no explicit fillna policy given, and there are still NA values in the data, I'd prefer to raise an error rather than silently pass bunk data through. Therefore nans should not be skipped in this step so that they can be checked for in the next line.


row_totals = df.sum(axis=1, skipna=False)
if row_totals.isna().any():
raise ValueError("Unhandled NA values in dummy array")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this tested?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some holes left in the tests, on the to do list

pandas/core/arrays/categorical.py Outdated Show resolved Hide resolved
@github-actions
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Oct 23, 2020
@arw2019
Copy link
Member

arw2019 commented Nov 6, 2020

@clbarnes how's this going? If you're interested in continuing can you merge master & resolve conflicts and address comments?

@clbarnes
Copy link
Contributor Author

clbarnes commented Nov 6, 2020

Sure, will do - did the first 90% but things got busy before I could tackle the last 90%. I'll try to get to this next week.

@arw2019
Copy link
Member

arw2019 commented Nov 6, 2020

Sure, will do - did the first 90% but things got busy before I could tackle the last 90%. I'll try to get to this next week.

Sounds good (& no rush ofc). Ping us when you're ready for the next round of reviews

@arw2019 arw2019 removed the Stale label Nov 6, 2020
@github-actions
Copy link
Contributor

github-actions bot commented Dec 7, 2020

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Dec 7, 2020
@jreback
Copy link
Contributor

jreback commented Feb 11, 2021

@clbarnes still active? I think this was pretty close.

@mroeschke
Copy link
Member

Thanks for the PR, but it appears to have gotten stale. Going to close for now but let us know if you can merge master and update and we'd be happy to reopen.

@mroeschke mroeschke closed this Mar 21, 2021
@pckSF pckSF mentioned this pull request Jun 9, 2021
10 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

API/ENH: from_dummies
8 participants