PERF: fix #32976 slow group by for categorical columns #33739

rtlee9 · 2020-04-23T04:05:52Z

Aggregate categorical codes with fast cython aggregation for select how operations. Added new ASV benchmark copied from 32976 indicating > 99% improvement in performance for this case.

closes Categorical columns are slow in groupby operations #32976
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

rtlee9 · 2020-04-23T04:19:49Z

pandas/core/groupby/generic.py

-                result, _ = self.grouper.aggregate(
-                    block.values, how, axis=1, min_count=min_count
+
+                cat_method_blacklist = (


I copied this list from asv_bench.benchmarks.groupby.method_blacklist['object'] and appended the "add" method. Is there a better way of blacklisting these methods which shouldn't be applied to categorical codes?

tests.groupby.test_function.test_arg_passthru is an example of a test failure without this blacklisting.

pandas/core/groupby/generic.py

TomAugspurger

@jbrockmendel how close do you think we are to defining an API to allow passing EA values to cython arrays?

# ignore the method names.
values = self._values_for_cython(method="first")
# ... operation on values
result = self._result_from_cython(..., dtype=self.dtype method="first")

For categorical, that method returns codes and result_from_cython is from_codes.

TomAugspurger · 2020-04-23T16:03:29Z

pandas/core/groupby/generic.py

+                        categories=block.values.categories,
+                        ordered=block.values.ordered,
+                    )


Suggested change

categories=block.values.categories,

ordered=block.values.ordered,

)

dtype=block.values.dtype

)

I actually tried this and found that it raises a ValueError here: Cannot specify categories or ordered together with dtype

doc/source/whatsnew/v1.1.0.rst

jbrockmendel · 2020-04-23T16:21:50Z

how close do you think we are to defining an API to allow passing EA values to cython arrays?
values = self._values_for_cython(method="first")

If we restrict attention to ordering-based cython methods then i think this is pretty straightforward, just need to get everyone on board.

asv_bench/benchmarks/groupby.py

pandas/core/groupby/generic.py

asv_bench/benchmarks/groupby.py

doc/source/whatsnew/v1.1.0.rst

jreback · 2020-04-25T21:52:19Z

pandas/core/groupby/ops.py

@@ -472,6 +473,29 @@ def _cython_operation(

        is_datetimelike = needs_i8_conversion(values.dtype)
        is_numeric = is_numeric_dtype(values.dtype)
+        is_categorical = is_categorical_dtype(values)
+        cat_method_blacklist = (
+            "add",


why do we need a blacklist at all? you are already only operating on the codes

If this method is passed categorical values with a "how" like mean then we don't want to average the codes as if they were ints -- this would happen in the pandas.tests.groupby.test_function.test_arg_passthru test without the blacklist, for example. the logic for cases these blacklist cases is already handled in higher in the call stack in pandas.core.groupby.generic.DataFrameGroupBy._cython_agg_blocks but maybe we could move some of that logic here for conciseness and/or perf later on

i don't think you need this at all, what breaks if you take it out entirely (the blackist)

This test will fail without the blacklist: pandas.tests.groupby.test_function.test_arg_passthru. Here's a minimal example that will fail:

import pandas as pd df = pd.DataFrame( {"group": [1, 1, 2], "category_string": pd.Series(list("abc")).astype("category")}, columns=["group", "category_string"], ) df.groupby("group").mean(numeric_only=False)

Nothing else fails so perhaps this test can be updated in lieu of the blacklist, not sure if this is fine from an interface perspective.

yeah i think its ok to simply update this test

It really seems like we need this blacklist anytime the cython op does something with the actual values then we can't pass the codes there. For example something like mean or sum on a numeric categorical.

I see. ok then, let's move this to an attribute on the class, or a function maybe better for easier re-sue

Moved to class level in 79f0c72. We could easily make this accessible outside the class with an additional method w/ property decorator

jreback · 2020-04-25T21:52:48Z

pandas/tests/groupby/aggregate/test_aggregate.py

@@ -466,7 +466,7 @@ def test_agg_cython_category_not_implemented_fallback():
    result = df.groupby("col_num").col_cat.first()
    expected = pd.Series(
        [1, 2, 3], index=pd.Index([1, 2, 3], name="col_num"), name="col_cat"
-    )
+    ).astype("category")


why is this changed?

The _cython_operation method in this commit returns a categorical if passed a categorical. In this test the "col_cat" column is categorical so it seems intuitive to me that when aggregated over it should also be a category.

The fact that this test was passing before suggests that this is not also true in the higher order functions like DataFrameGroupBy._cython_agg_blocks.

This change needs a dedicated whatsnew entry.

Agreed that preserving the dtype is probably the right choice, but this is a relatively large change.

Sure, just added a new entry in the "Other API changes" section in commit 6498d6b

jreback · 2020-04-25T21:53:09Z

pls show the performance before / after.

jreback · 2020-04-25T21:53:57Z

asv_bench/benchmarks/groupby.py

+        )
+        self.df_cat_values = df_int.astype({"cat": CAT})
+
+    def time_groupby(self):


likely need to paramaterize this over a bunch of functions. assume this is not too slow to do that. if so pls reduce the size of the benchmark to make it reasonable to parameterize.

this takes 10.2 ms ± 28.7 µs per loop on the current commit on my desktop via timeit but 7.14 s ± 31.3 ms per loop on master. I think 10ms should be ok but please let me know otherwise

yes this is fine, but i need you to paramaterize over multiple functions, doesn't have to be every one but represententative ones (e.g. reductions, transforms and filters)

paramaterized over column types in b4648d5

rtlee9 · 2020-04-25T23:22:44Z

latest asv results for new benchmark replicating #32976:

$ asv continuous HEAD -b 'groupby.CategoricalFrame'

before           after         ratio    
     [77a0f19c]       [ec70c57e]
     <master>         <cat_groupby_fix>
-      1.68±0.01s       4.51±0.1ms     0.00  groupby.CategoricalFrame.time_groupby
-      1.68±0.01s      4.43±0.09ms     0.00  groupby.CategoricalFrame.time_groupby_ordered
                                                                 
SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.          
PERFORMANCE INCREASED.

jreback · 2020-04-26T19:59:11Z

asv_bench/benchmarks/groupby.py

+        )
+        self.df_cat_values = df_int.astype({"cat": CAT})
+
+    def time_groupby(self):


yes this is fine, but i need you to paramaterize over multiple functions, doesn't have to be every one but represententative ones (e.g. reductions, transforms and filters)

jreback · 2020-04-26T19:59:35Z

pandas/core/groupby/ops.py

@@ -472,6 +473,29 @@ def _cython_operation(

        is_datetimelike = needs_i8_conversion(values.dtype)
        is_numeric = is_numeric_dtype(values.dtype)
+        is_categorical = is_categorical_dtype(values)
+        cat_method_blacklist = (
+            "add",


i don't think you need this at all, what breaks if you take it out entirely (the blackist)

pep8speaks · 2020-04-26T22:10:25Z

Hello @rtlee9! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-08-04 01:53:13 UTC

jreback · 2020-04-26T22:23:46Z

asv_bench/benchmarks/groupby.py

@@ -510,6 +512,33 @@ def time_groupby_extra_cat_nosort(self):
        self.df_extra_cat.groupby("a", sort=False)["b"].count()


+class CategoricalFrame:
+    # benchmark grouping with operations on categorical values (GH #32976)
+    param_names = ["groupby_type", "value_type"]


oarameterize over

mean and head as well

i would also reduce the number of groups as well
it will still show an appreciable diff

done. but I swapped mean for count since mean requires numeric types

i also removed the str groupby parameter option to limit the number of tests but the benchmark results are the same either way for this diff.

here are the latest results:

before after ratio [22cf0f5d] [a16f6a23] <master> - 167±1ms 2.72±0.09ms 0.02 groupby.CategoricalFrame.time_groupby(<class 'int'>, <class 'str'>, 'last') - 168±2ms 2.66±0.07ms 0.02 groupby.CategoricalFrame.time_groupby_ordered(<class 'int'>, <class 'str'>, 'last') - 174±2ms 2.65±0.09ms 0.02 groupby.CategoricalFrame.time_groupby_ordered(<class 'int'>, <class 'int'>, 'last') - 175±2ms 2.61±0.05ms 0.01 groupby.CategoricalFrame.time_groupby(<class 'int'>, <class 'int'>, 'last') SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY. PERFORMANCE INCREASED.

jreback · 2020-04-28T19:42:36Z

pandas/core/groupby/ops.py

@@ -472,6 +473,29 @@ def _cython_operation(

        is_datetimelike = needs_i8_conversion(values.dtype)
        is_numeric = is_numeric_dtype(values.dtype)
+        is_categorical = is_categorical_dtype(values)
+        cat_method_blacklist = (
+            "add",


yeah i think its ok to simply update this test

TomAugspurger · 2020-05-08T21:00:34Z

@rtlee9 Can you merge master and fix the conflict?

rtlee9 · 2020-05-09T19:02:37Z

Fixed conflict and merged in 7570425

jreback · 2020-05-09T19:15:32Z

doc/source/whatsnew/v1.1.0.rst

@@ -239,7 +239,7 @@ Other API changes
 - ``loc`` lookups with an object-dtype :class:`Index` and an integer key will now raise ``KeyError`` instead of ``TypeError`` when key is missing (:issue:`31905`)
 - Using a :func:`pandas.api.indexers.BaseIndexer` with ``count``, ``min``, ``max``, ``median``, ``skew``,  ``cov``, ``corr`` will now return correct results for any monotonic :func:`pandas.api.indexers.BaseIndexer` descendant (:issue:`32865`)
 - Added a :func:`pandas.api.indexers.FixedForwardWindowIndexer` class to support forward-looking windows during ``rolling`` operations.
-
+- :meth:`DataFrame.groupby` aggregations of categorical series will now return a :class:`Categorical` while preserving the codes and categories of the original series


is this some other issue?

This is a (desirable) side effect of this PR - please find more context in this thread.

jreback · 2020-05-09T19:16:27Z

pandas/core/groupby/ops.py

@@ -494,6 +519,17 @@ def _cython_operation(
                values = ensure_int_or_float(values)
        elif is_numeric and not is_complex_dtype(values):
            values = ensure_float64(values)
+        elif is_categorical:
+            if how in self._cat_method_blacklist:


I really don't like doing this. Can you elaborate when we can actually process this? listing methods is a bad idea generally.

I'm not exactly sure what you mean by "when we can actually process this" this but I agree that listing methods isn't necessarily thorough and isn't robust. However, I've been unable to find a suitable alternative to blacklisting methods where we don't want to apply the aggregation on the category codes -- open to other ideas though. Please find more context in this thread.

jreback · 2020-05-25T22:30:54Z

@rtlee9 if you'd address the comments and merge master

jreback

if u can merge master

i don't really like the specific method blacklist - if u can find a better way

simonjayhawkins · 2020-08-01T14:03:52Z

@rtlee9 can you move release note to 1.2 and merge upstream/master to resolve conflict

simonjayhawkins · 2020-08-03T13:09:48Z

doc/source/whatsnew/v1.2.0.rst

@@ -132,6 +133,7 @@ Plotting
 Groupby/resample/rolling
 ^^^^^^^^^^^^^^^^^^^^^^^^

+- :meth:`DataFrame.groupby` aggregations of categorical series will now return a :class:`Categorical` while preserving the codes and categories of the original series


can you add this PR as the issue (or relevant issue if discussed elsewhere.

Added the PR as the issue in commit 0fd3dde

Aggregate categorical codes with fast cython aggregation for select `how` operations. 8/1/20: rebase and move release note to 1.2 8/2/20: Update tests to expect categorical back 8/3/20: add PR as issue for whatsnew groupby api change

dsaxton · 2020-09-16T18:31:02Z

@rtlee9 Is this still active? Some merge conflicts that need resolving.

jreback · 2020-09-17T02:35:41Z

pandas/core/groupby/ops.py

@@ -356,6 +357,29 @@ def get_group_levels(self) -> List[Index]:

    _name_functions = {"ohlc": ["open", "high", "low", "close"]}

+    _cat_method_blacklist = (
+        "add",


what methods don't work?

arw2019 · 2020-11-06T01:53:02Z

Closing for now. @rtlee9 ping us whenever you'd like to continue and we'll reopen!

rtlee9 commented Apr 23, 2020

View reviewed changes

rtlee9 force-pushed the cat_groupby_fix branch from 41ae919 to 165cfbb Compare April 23, 2020 15:52

jbrockmendel reviewed Apr 23, 2020

View reviewed changes

pandas/core/groupby/generic.py Outdated Show resolved Hide resolved

TomAugspurger reviewed Apr 23, 2020

View reviewed changes

jreback requested changes Apr 23, 2020

View reviewed changes

asv_bench/benchmarks/groupby.py Outdated Show resolved Hide resolved

pandas/core/groupby/generic.py Outdated Show resolved Hide resolved

jreback added Categorical Categorical Data Type Performance Memory or execution speed performance labels Apr 23, 2020

rtlee9 force-pushed the cat_groupby_fix branch 2 times, most recently from 696508a to 0340554 Compare April 25, 2020 21:06

jreback requested changes Apr 25, 2020

View reviewed changes

jreback reviewed Apr 25, 2020

View reviewed changes

rtlee9 force-pushed the cat_groupby_fix branch from 0340554 to ec70c57 Compare April 25, 2020 23:22

jreback requested changes Apr 26, 2020

View reviewed changes

rtlee9 force-pushed the cat_groupby_fix branch 2 times, most recently from 4398706 to b4648d5 Compare April 26, 2020 22:10

rtlee9 force-pushed the cat_groupby_fix branch from b4648d5 to 673573b Compare April 26, 2020 22:12

jreback reviewed Apr 26, 2020

View reviewed changes

rtlee9 force-pushed the cat_groupby_fix branch 4 times, most recently from 6498d6b to 9377e8d Compare April 28, 2020 14:55

jreback requested changes Apr 28, 2020

View reviewed changes

rtlee9 force-pushed the cat_groupby_fix branch from 9377e8d to 79f0c72 Compare April 30, 2020 04:15

rtlee9 force-pushed the cat_groupby_fix branch from 79f0c72 to 7570425 Compare May 9, 2020 18:34

jreback requested changes May 9, 2020

View reviewed changes

rtlee9 force-pushed the cat_groupby_fix branch 2 times, most recently from acc330a to 46e1e8e Compare May 29, 2020 04:51

jreback requested changes Jul 17, 2020

View reviewed changes

rtlee9 force-pushed the cat_groupby_fix branch 3 times, most recently from 964ac84 to 725c6d2 Compare August 2, 2020 21:25

simonjayhawkins reviewed Aug 3, 2020

View reviewed changes

rtlee9 force-pushed the cat_groupby_fix branch from 725c6d2 to 0fd3dde Compare August 4, 2020 01:53

dsaxton added the Stale label Sep 16, 2020

dsaxton mentioned this pull request Sep 16, 2020

CI: Add stale PR action #36336

Merged

jreback requested changes Sep 17, 2020

View reviewed changes

arw2019 closed this Nov 6, 2020

PERF: fix #32976 slow group by for categorical columns #33739

PERF: fix #32976 slow group by for categorical columns #33739

Conversation

rtlee9 commented Apr 23, 2020

Choose a reason for hiding this comment

TomAugspurger left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jbrockmendel commented Apr 23, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Apr 25, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rtlee9 commented Apr 25, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pep8speaks commented Apr 26, 2020 • edited Loading

Comment last updated at 2020-08-04 01:53:13 UTC

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger commented May 8, 2020

rtlee9 commented May 9, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented May 25, 2020

jreback left a comment

Choose a reason for hiding this comment

simonjayhawkins commented Aug 1, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dsaxton commented Sep 16, 2020

Choose a reason for hiding this comment

arw2019 commented Nov 6, 2020

pep8speaks commented Apr 26, 2020 •

edited

Loading