Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: Index.append behaviour with categoricals #14586

Closed
jorisvandenbossche opened this issue Nov 4, 2016 · 1 comment · Fixed by #38098
Closed

API: Index.append behaviour with categoricals #14586

jorisvandenbossche opened this issue Nov 4, 2016 · 1 comment · Fixed by #38098
Labels
API Design Categorical Categorical Data Type Enhancement Index Related to the Index class or subclasses Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Milestone

Comments

@jorisvandenbossche
Copy link
Member

Follow-up of #14545.

We had a long discussion on what the behaviour of concat should be when you have categorical data: #13767. In the end, for 0.19.0, we changed the behaviour of raising an error when categories didn't match to returning object dtyped data (only data with identical categories and ordered attributed gives a categorical as result). The table below is a summary of the changes between 0.18.1 and 0.19.0:

For categorical Series:

left right append/concat 0.18 append/concat 0.19.0
category category (identical categories) category category
category category (different categories) error object
category not category category object
category not category (different categories) category with NaNs object

However, we didn't change behaviour of append for Indexes (the above append is for series):

For CategoricalIndex:

left right append 0.18 append 0.19.0 append 0.19.1
category category (identical categories) category category category
category category (different categories) error error error
category not category category category category
category not category (with other values) error error error
not category category (with other values) object error object

The last line, i.e. the case where the calling Index is not a CategoricalIndex, changed by accident in 0.19.0, and it is this that I corrected for in PR #14545 for 0.19.1.

Questions:

  • Do we want the same behaviour for Index.append as we now have for Series.append with categorical data? This means that the column in the table above becomes 'object' apart from the first row.
  • Do we want to make an exception for the case where the values in the 'right' correspond with the categories? (so that pd.CategoricalIndex(['a', 'b', 'c']).append(pd.Index(['a']))keeps working)

Changing this to always return object dtype unless for categoricals with indentical categories is easy, but gives a few failures in our test suite. Namely, in some indexing tests (indexing a DataFrame with a CategoricalIndex) there are changes in behaviour because indexing with a non-existing value in the index was performed using CategoricalIndex.append(). But this we can workaround in the indexing code of course.

cc @JanSchulz @sinhrks

@jorisvandenbossche jorisvandenbossche added Indexing Related to indexing on series/frames, not to indexes themselves API Design Categorical Categorical Data Type labels Nov 4, 2016
@jorisvandenbossche jorisvandenbossche added this to the 0.20.0 milestone Nov 4, 2016
@jreback jreback modified the milestones: 0.20.0, Next Major Release Mar 23, 2017
@toobaz toobaz added Index Related to the Index class or subclasses and removed Indexing Related to indexing on series/frames, not to indexes themselves labels Jun 28, 2019
@mroeschke mroeschke added Enhancement Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels May 5, 2020
@jbrockmendel
Copy link
Member

I'm running into this in a couple of places, mainly in setitem-with-expansion operations:

cat = pd.Categorical(["a", "b", "c", "d", "e"], ordered=True)

ci = pd.CategoricalIndex(cat[:-1])   # <-- "e" is an unused category

df = pd.DataFrame(np.random.randn(3, 4), columns=ci)

>>> df["e"] = 0
>>> is_dtype_equal(df.columns.dtype, cat.dtype)
True

>>> df["f"] = 1
TypeError: 'fill_value=f' is not present in this Categorical's categories

xref #37774 im inclined to cast rather than raise here. One option that this could look like is:

     def _concat(self, to_concat: List["Index"], name: Label) -> "CategoricalIndex":
         # if calling index is category, don't check dtype of others
-        codes = np.concatenate([self._is_dtype_compat(c).codes for c in to_concat])
-        cat = self._data._from_backing_data(codes)
-        return type(self)._simple_new(cat, name=name)
+        try:
+            codes = np.concatenate([self._is_dtype_compat(c).codes for c in to_concat])
+        except TypeError:
+            # not all to_concat elements are among our categories (or NA)
+            from pandas.core.dtypes.concat import concat_compat
+
+            res = concat_compat(to_concat)
+            return Index(res, name=name)
+        else:
+            cat = self._data._from_backing_data(codes)
+            return type(self)._simple_new(cat, name=name)

(another option would be to work this behavior into concat_compat; i havent thought about this much. I guess i do like this idea on the principle of having every concat-like operation behave the same way)

Making this change would also allow us to get rid of a couple of try/excepts in reshape.pivot, and get rid of _to_safe_for_reshape.

@jreback jreback modified the milestones: Contributions Welcome, 1.2 Nov 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Categorical Categorical Data Type Enhancement Index Related to the Index class or subclasses Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants