Skip to content

Commit

Permalink
DOC: Fix docs on merging categoricals. (#28185)
Browse files Browse the repository at this point in the history
  • Loading branch information
ivirshup authored and WillAyd committed Nov 8, 2019
1 parent 62f6a42 commit 3b58f48
Show file tree
Hide file tree
Showing 2 changed files with 36 additions and 61 deletions.
95 changes: 35 additions & 60 deletions doc/source/user_guide/categorical.rst
Original file line number Diff line number Diff line change
Expand Up @@ -797,37 +797,52 @@ Assigning a ``Categorical`` to parts of a column of other types will use the val
df.dtypes
.. _categorical.merge:
.. _categorical.concat:

Merging
~~~~~~~
Merging / Concatenation
~~~~~~~~~~~~~~~~~~~~~~~

You can concat two ``DataFrames`` containing categorical data together,
but the categories of these categoricals need to be the same:
By default, combining ``Series`` or ``DataFrames`` which contain the same
categories results in ``category`` dtype, otherwise results will depend on the
dtype of the underlying categories. Merges that result in non-categorical
dtypes will likely have higher memory usage. Use ``.astype`` or
``union_categoricals`` to ensure ``category`` results.

.. ipython:: python
cat = pd.Series(["a", "b"], dtype="category")
vals = [1, 2]
df = pd.DataFrame({"cats": cat, "vals": vals})
res = pd.concat([df, df])
res
res.dtypes
from pandas.api.types import union_categoricals
In this case the categories are not the same, and therefore an error is raised:
# same categories
s1 = pd.Series(['a', 'b'], dtype='category')
s2 = pd.Series(['a', 'b', 'a'], dtype='category')
pd.concat([s1, s2])
.. ipython:: python
# different categories
s3 = pd.Series(['b', 'c'], dtype='category')
pd.concat([s1, s3])
df_different = df.copy()
df_different["cats"].cat.categories = ["c", "d"]
try:
pd.concat([df, df_different])
except ValueError as e:
print("ValueError:", str(e))
# Output dtype is inferred based on categories values
int_cats = pd.Series([1, 2], dtype="category")
float_cats = pd.Series([3.0, 4.0], dtype="category")
pd.concat([int_cats, float_cats])
pd.concat([s1, s3]).astype('category')
union_categoricals([s1.array, s3.array])
The same applies to ``df.append(df_different)``.
The following table summarizes the results of merging ``Categoricals``:

See also the section on :ref:`merge dtypes<merging.dtypes>` for notes about preserving merge dtypes and performance.
+-------------------+------------------------+----------------------+-----------------------------+
| arg1 | arg2 | identical | result |
+===================+========================+======================+=============================+
| category | category | True | category |
+-------------------+------------------------+----------------------+-----------------------------+
| category (object) | category (object) | False | object (dtype is inferred) |
+-------------------+------------------------+----------------------+-----------------------------+
| category (int) | category (float) | False | float (dtype is inferred) |
+-------------------+------------------------+----------------------+-----------------------------+

See also the section on :ref:`merge dtypes<merging.dtypes>` for notes about
preserving merge dtypes and performance.

.. _categorical.union:

Expand Down Expand Up @@ -918,46 +933,6 @@ the resulting array will always be a plain ``Categorical``:
# "b" is coded to 0 throughout, same as c1, different from c2
c.codes
.. _categorical.concat:

Concatenation
~~~~~~~~~~~~~

This section describes concatenations specific to ``category`` dtype. See :ref:`Concatenating objects<merging.concat>` for general description.

By default, ``Series`` or ``DataFrame`` concatenation which contains the same categories
results in ``category`` dtype, otherwise results in ``object`` dtype.
Use ``.astype`` or ``union_categoricals`` to get ``category`` result.

.. ipython:: python
# same categories
s1 = pd.Series(['a', 'b'], dtype='category')
s2 = pd.Series(['a', 'b', 'a'], dtype='category')
pd.concat([s1, s2])
# different categories
s3 = pd.Series(['b', 'c'], dtype='category')
pd.concat([s1, s3])
pd.concat([s1, s3]).astype('category')
union_categoricals([s1.array, s3.array])
Following table summarizes the results of ``Categoricals`` related concatenations.

+----------+--------------------------------------------------------+----------------------------+
| arg1 | arg2 | result |
+==========+========================================================+============================+
| category | category (identical categories) | category |
+----------+--------------------------------------------------------+----------------------------+
| category | category (different categories, both not ordered) | object (dtype is inferred) |
+----------+--------------------------------------------------------+----------------------------+
| category | category (different categories, either one is ordered) | object (dtype is inferred) |
+----------+--------------------------------------------------------+----------------------------+
| category | not category | object (dtype is inferred) |
+----------+--------------------------------------------------------+----------------------------+

Getting data in/out
-------------------
Expand Down
2 changes: 1 addition & 1 deletion doc/source/user_guide/merging.rst
Original file line number Diff line number Diff line change
Expand Up @@ -881,7 +881,7 @@ The merged result:
.. note::

The category dtypes must be *exactly* the same, meaning the same categories and the ordered attribute.
Otherwise the result will coerce to ``object`` dtype.
Otherwise the result will coerce to the categories' dtype.

.. note::

Expand Down

0 comments on commit 3b58f48

Please sign in to comment.