Skip to content

Commit

Permalink
DOC: Combine concat/ merge sections for categoricals
Browse files Browse the repository at this point in the history
  • Loading branch information
ivirshup committed Oct 7, 2019
1 parent 1b9bb12 commit 9fb6d67
Showing 1 changed file with 32 additions and 59 deletions.
91 changes: 32 additions & 59 deletions doc/source/user_guide/categorical.rst
Original file line number Diff line number Diff line change
Expand Up @@ -797,34 +797,47 @@ Assigning a ``Categorical`` to parts of a column of other types will use the val
df.dtypes
.. _categorical.merge:
.. _categorical.concat:

Merging
~~~~~~~
Merging / Concatenation
~~~~~~~~~~~~~~~~~~~~~~~

You can concat two ``DataFrames`` containing categorical data together,
but the categories of these categoricals need to be the same:
By default, combining ``Series`` or ``DataFrames`` which contain the same
categories results in ``category`` dtype, otherwise results will depend on the
dtype of the underlying categories. Merges that result in non-categorical
dtypes will likely have higher memory usage. Use ``.astype`` or
``union_categoricals`` to ensure ``category`` results.

.. ipython:: python
cat = pd.Series(["a", "b"], dtype="category")
vals = [1, 2]
df = pd.DataFrame({"cats": cat, "vals": vals})
res = pd.concat([df, df])
res
res.dtypes
from pandas.api.types import union_categoricals
If the categories are not exactly the same, merging will coerce the
categoricals to their categories' dtypes:
# same categories
s1 = pd.Series(['a', 'b'], dtype='category')
s2 = pd.Series(['a', 'b', 'a'], dtype='category')
pd.concat([s1, s2])
# different categories
s3 = pd.Series(['b', 'c'], dtype='category')
pd.concat([s1, s3])
pd.concat([s1, s3]).astype('category')
union_categoricals([s1.array, s3.array])
.. ipython:: python
df_different = df.copy()
df_different["cats"].cat.categories = ["c", "d"]
res = pd.concat([df, df_different])
res
res.dtypes
Following table summarizes the results of ``Categoricals`` related combinations.

The same applies to ``df.append(df_different)``.
+----------+--------------------------------------------------------+----------------------------+
| arg1 | arg2 | result |
+==========+========================================================+============================+
| category | category (identical categories) | category |
+----------+--------------------------------------------------------+----------------------------+
| category | category (different categories, both not ordered) | object (dtype is inferred) |
+----------+--------------------------------------------------------+----------------------------+
| category | category (different categories, either one is ordered) | object (dtype is inferred) |
+----------+--------------------------------------------------------+----------------------------+
| category | not category | object (dtype is inferred) |
+----------+--------------------------------------------------------+----------------------------+

See also the section on :ref:`merge dtypes<merging.dtypes>` for notes about preserving merge dtypes and performance.

Expand Down Expand Up @@ -920,46 +933,6 @@ the resulting array will always be a plain ``Categorical``:
# "b" is coded to 0 throughout, same as c1, different from c2
c.codes
.. _categorical.concat:

Concatenation
~~~~~~~~~~~~~

This section describes concatenations specific to ``category`` dtype. See :ref:`Concatenating objects<merging.concat>` for general description.

By default, ``Series`` or ``DataFrame`` concatenation which contains the same categories
results in ``category`` dtype, otherwise results in ``object`` dtype.
Use ``.astype`` or ``union_categoricals`` to get ``category`` result.

.. ipython:: python
# same categories
s1 = pd.Series(['a', 'b'], dtype='category')
s2 = pd.Series(['a', 'b', 'a'], dtype='category')
pd.concat([s1, s2])
# different categories
s3 = pd.Series(['b', 'c'], dtype='category')
pd.concat([s1, s3])
pd.concat([s1, s3]).astype('category')
union_categoricals([s1.array, s3.array])
Following table summarizes the results of ``Categoricals`` related concatenations.

+----------+--------------------------------------------------------+----------------------------+
| arg1 | arg2 | result |
+==========+========================================================+============================+
| category | category (identical categories) | category |
+----------+--------------------------------------------------------+----------------------------+
| category | category (different categories, both not ordered) | object (dtype is inferred) |
+----------+--------------------------------------------------------+----------------------------+
| category | category (different categories, either one is ordered) | object (dtype is inferred) |
+----------+--------------------------------------------------------+----------------------------+
| category | not category | object (dtype is inferred) |
+----------+--------------------------------------------------------+----------------------------+

Getting data in/out
-------------------
Expand Down

0 comments on commit 9fb6d67

Please sign in to comment.