Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC: Fix docs on merging categoricals. #28185

Merged
merged 4 commits into from
Nov 8, 2019
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
81 changes: 22 additions & 59 deletions doc/source/user_guide/categorical.rst
Original file line number Diff line number Diff line change
Expand Up @@ -797,34 +797,37 @@ Assigning a ``Categorical`` to parts of a column of other types will use the val
df.dtypes

.. _categorical.merge:
.. _categorical.concat:

Merging
~~~~~~~
Merging / Concatenation
~~~~~~~~~~~~~~~~~~~~~~~

You can concat two ``DataFrames`` containing categorical data together,
but the categories of these categoricals need to be the same:
By default, combining ``Series`` or ``DataFrames`` which contain the same
categories results in ``category`` dtype, otherwise results will depend on the
dtype of the underlying categories. Merges that result in non-categorical
dtypes will likely have higher memory usage. Use ``.astype`` or
``union_categoricals`` to ensure ``category`` results.

.. ipython:: python

cat = pd.Series(["a", "b"], dtype="category")
vals = [1, 2]
df = pd.DataFrame({"cats": cat, "vals": vals})
res = pd.concat([df, df])
res
res.dtypes
from pandas.api.types import union_categoricals

In this case the categories are not the same, and therefore an error is raised:
# same categories
s1 = pd.Series(['a', 'b'], dtype='category')
s2 = pd.Series(['a', 'b', 'a'], dtype='category')
pd.concat([s1, s2])

.. ipython:: python
# different categories
s3 = pd.Series(['b', 'c'], dtype='category')
pd.concat([s1, s3])

df_different = df.copy()
df_different["cats"].cat.categories = ["c", "d"]
try:
pd.concat([df, df_different])
except ValueError as e:
print("ValueError:", str(e))
# Output dtype is inferred based on categories values
int_cats = pd.Series([1, 2], dtype="category")
float_cats = pd.Series([3.0, 4.0], dtype="category")
pd.concat([int_cats, float_cats])

The same applies to ``df.append(df_different)``.
pd.concat([s1, s3]).astype('category')
union_categoricals([s1.array, s3.array])

See also the section on :ref:`merge dtypes<merging.dtypes>` for notes about preserving merge dtypes and performance.

Expand Down Expand Up @@ -920,46 +923,6 @@ the resulting array will always be a plain ``Categorical``:
# "b" is coded to 0 throughout, same as c1, different from c2
c.codes

.. _categorical.concat:

Concatenation
~~~~~~~~~~~~~

This section describes concatenations specific to ``category`` dtype. See :ref:`Concatenating objects<merging.concat>` for general description.

By default, ``Series`` or ``DataFrame`` concatenation which contains the same categories
results in ``category`` dtype, otherwise results in ``object`` dtype.
Use ``.astype`` or ``union_categoricals`` to get ``category`` result.

.. ipython:: python

# same categories
s1 = pd.Series(['a', 'b'], dtype='category')
s2 = pd.Series(['a', 'b', 'a'], dtype='category')
pd.concat([s1, s2])

# different categories
s3 = pd.Series(['b', 'c'], dtype='category')
pd.concat([s1, s3])

pd.concat([s1, s3]).astype('category')
union_categoricals([s1.array, s3.array])


Following table summarizes the results of ``Categoricals`` related concatenations.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you keep this table? I think a good summary of what happens

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though update to reflect current status

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be useful it it said a bit more about the return types, though it might be better to just point to type promotion docs for that. How about something like this?

+----------+--------------------------------------------------------+----------------------------+		
 | arg1     | arg2                            |      identical       | result                     |		
 +==========+========================================================+============================+		
 | category | category                        | True                 | category                   |		
 +----------+--------------------------------------------------------+----------------------------+		
 | category (object)| category (object)       | False                | object (dtype is inferred) |		
 +----------+--------------------------------------------------------+----------------------------+		
 | category (int) | category (float)          | False                | float (dtype is inferred) |		
 +----------+--------------------------------------------------------+----------------------------+		

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems logical. Not sure if we have better verbiage than saying category (object) to refer to the categories of the categorical; @TomAugspurger might have thoughts

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd like to keep it close to the repr, since that should make it easier to relate to practice

>>> pd.Categorical(list("abc"))                                                                 
[a, b, c]
Categories (3, object): [a, b, c]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay @WillAyd, I thought we were waiting on @TomAugspurger. Just pushed an update. Hopefully I got the table right.


+----------+--------------------------------------------------------+----------------------------+
| arg1 | arg2 | result |
+==========+========================================================+============================+
| category | category (identical categories) | category |
+----------+--------------------------------------------------------+----------------------------+
| category | category (different categories, both not ordered) | object (dtype is inferred) |
+----------+--------------------------------------------------------+----------------------------+
| category | category (different categories, either one is ordered) | object (dtype is inferred) |
+----------+--------------------------------------------------------+----------------------------+
| category | not category | object (dtype is inferred) |
+----------+--------------------------------------------------------+----------------------------+


Getting data in/out
-------------------
Expand Down
2 changes: 1 addition & 1 deletion doc/source/user_guide/merging.rst
Original file line number Diff line number Diff line change
Expand Up @@ -883,7 +883,7 @@ The merged result:
.. note::

The category dtypes must be *exactly* the same, meaning the same categories and the ordered attribute.
Otherwise the result will coerce to ``object`` dtype.
Otherwise the result will coerce to the categories' dtype.

.. note::

Expand Down