Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: fix #32976 slow group by for categorical columns #33739

Closed
wants to merge 1 commit into from

Conversation

rtlee9
Copy link
Contributor

@rtlee9 rtlee9 commented Apr 23, 2020

Aggregate categorical codes with fast cython aggregation for select how operations. Added new ASV benchmark copied from 32976 indicating > 99% improvement in performance for this case.

result, _ = self.grouper.aggregate(
block.values, how, axis=1, min_count=min_count

cat_method_blacklist = (
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I copied this list from asv_bench.benchmarks.groupby.method_blacklist['object'] and appended the "add" method. Is there a better way of blacklisting these methods which shouldn't be applied to categorical codes?

tests.groupby.test_function.test_arg_passthru is an example of a test failure without this blacklisting.

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbrockmendel how close do you think we are to defining an API to allow passing EA values to cython arrays?

# ignore the method names.
values = self._values_for_cython(method="first")
# ... operation on values
result = self._result_from_cython(..., dtype=self.dtype method="first")

For categorical, that method returns codes and result_from_cython is from_codes.

Comment on lines 1095 to 1096
categories=block.values.categories,
ordered=block.values.ordered,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
categories=block.values.categories,
ordered=block.values.ordered,
)
dtype=block.values.dtype
)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually tried this and found that it raises a ValueError here: Cannot specify categories or ordered together with dtype

doc/source/whatsnew/v1.1.0.rst Outdated Show resolved Hide resolved
@jbrockmendel
Copy link
Member

how close do you think we are to defining an API to allow passing EA values to cython arrays?
values = self._values_for_cython(method="first")

If we restrict attention to ordering-based cython methods then i think this is pretty straightforward, just need to get everyone on board.

asv_bench/benchmarks/groupby.py Outdated Show resolved Hide resolved
pandas/core/groupby/generic.py Outdated Show resolved Hide resolved
@jreback jreback added Categorical Categorical Data Type Performance Memory or execution speed performance labels Apr 23, 2020
@rtlee9 rtlee9 force-pushed the cat_groupby_fix branch 2 times, most recently from 696508a to 0340554 Compare April 25, 2020 21:06
asv_bench/benchmarks/groupby.py Outdated Show resolved Hide resolved
doc/source/whatsnew/v1.1.0.rst Outdated Show resolved Hide resolved
@@ -472,6 +473,29 @@ def _cython_operation(

is_datetimelike = needs_i8_conversion(values.dtype)
is_numeric = is_numeric_dtype(values.dtype)
is_categorical = is_categorical_dtype(values)
cat_method_blacklist = (
"add",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need a blacklist at all? you are already only operating on the codes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this method is passed categorical values with a "how" like mean then we don't want to average the codes as if they were ints -- this would happen in the pandas.tests.groupby.test_function.test_arg_passthru test without the blacklist, for example. the logic for cases these blacklist cases is already handled in higher in the call stack in pandas.core.groupby.generic.DataFrameGroupBy._cython_agg_blocks but maybe we could move some of that logic here for conciseness and/or perf later on

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think you need this at all, what breaks if you take it out entirely (the blackist)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test will fail without the blacklist: pandas.tests.groupby.test_function.test_arg_passthru. Here's a minimal example that will fail:

import pandas as pd

df = pd.DataFrame(
    {"group": [1, 1, 2], "category_string": pd.Series(list("abc")).astype("category")},
    columns=["group", "category_string"],
)

df.groupby("group").mean(numeric_only=False)

Nothing else fails so perhaps this test can be updated in lieu of the blacklist, not sure if this is fine from an interface perspective.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah i think its ok to simply update this test

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It really seems like we need this blacklist anytime the cython op does something with the actual values then we can't pass the codes there. For example something like mean or sum on a numeric categorical.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. ok then, let's move this to an attribute on the class, or a function maybe better for easier re-sue

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to class level in 79f0c72. We could easily make this accessible outside the class with an additional method w/ property decorator

@@ -466,7 +466,7 @@ def test_agg_cython_category_not_implemented_fallback():
result = df.groupby("col_num").col_cat.first()
expected = pd.Series(
[1, 2, 3], index=pd.Index([1, 2, 3], name="col_num"), name="col_cat"
)
).astype("category")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this changed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _cython_operation method in this commit returns a categorical if passed a categorical. In this test the "col_cat" column is categorical so it seems intuitive to me that when aggregated over it should also be a category.

The fact that this test was passing before suggests that this is not also true in the higher order functions like DataFrameGroupBy._cython_agg_blocks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change needs a dedicated whatsnew entry.

Agreed that preserving the dtype is probably the right choice, but this is a relatively large change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, just added a new entry in the "Other API changes" section in commit 6498d6b

@jreback
Copy link
Contributor

jreback commented Apr 25, 2020

pls show the performance before / after.

)
self.df_cat_values = df_int.astype({"cat": CAT})

def time_groupby(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

likely need to paramaterize this over a bunch of functions. assume this is not too slow to do that. if so pls reduce the size of the benchmark to make it reasonable to parameterize.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this takes 10.2 ms ± 28.7 µs per loop on the current commit on my desktop via timeit but 7.14 s ± 31.3 ms per loop on master. I think 10ms should be ok but please let me know otherwise

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes this is fine, but i need you to paramaterize over multiple functions, doesn't have to be every one but represententative ones (e.g. reductions, transforms and filters)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

paramaterized over column types in b4648d5

@rtlee9
Copy link
Contributor Author

rtlee9 commented Apr 25, 2020

latest asv results for new benchmark replicating #32976:

$ asv continuous HEAD -b 'groupby.CategoricalFrame'

before           after         ratio    
     [77a0f19c]       [ec70c57e]
     <master>         <cat_groupby_fix>
-      1.68±0.01s       4.51±0.1ms     0.00  groupby.CategoricalFrame.time_groupby
-      1.68±0.01s      4.43±0.09ms     0.00  groupby.CategoricalFrame.time_groupby_ordered
                                                                 
SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.          
PERFORMANCE INCREASED.

)
self.df_cat_values = df_int.astype({"cat": CAT})

def time_groupby(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes this is fine, but i need you to paramaterize over multiple functions, doesn't have to be every one but represententative ones (e.g. reductions, transforms and filters)

@@ -472,6 +473,29 @@ def _cython_operation(

is_datetimelike = needs_i8_conversion(values.dtype)
is_numeric = is_numeric_dtype(values.dtype)
is_categorical = is_categorical_dtype(values)
cat_method_blacklist = (
"add",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think you need this at all, what breaks if you take it out entirely (the blackist)

@rtlee9 rtlee9 force-pushed the cat_groupby_fix branch 2 times, most recently from 4398706 to b4648d5 Compare April 26, 2020 22:10
@pep8speaks
Copy link

pep8speaks commented Apr 26, 2020

Hello @rtlee9! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-08-04 01:53:13 UTC

@@ -510,6 +512,33 @@ def time_groupby_extra_cat_nosort(self):
self.df_extra_cat.groupby("a", sort=False)["b"].count()


class CategoricalFrame:
# benchmark grouping with operations on categorical values (GH #32976)
param_names = ["groupby_type", "value_type"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oarameterize over

mean and head as well

i would also reduce the number of groups as well
it will still show an appreciable diff

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done. but I swapped mean for count since mean requires numeric types

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i also removed the str groupby parameter option to limit the number of tests but the benchmark results are the same either way for this diff.

here are the latest results:

       before           after         ratio
     [22cf0f5d]       [a16f6a23]
     <master>                   
-         167±1ms      2.72±0.09ms     0.02  groupby.CategoricalFrame.time_groupby(<class 'int'>, <class 'str'>, 'last')
-         168±2ms      2.66±0.07ms     0.02  groupby.CategoricalFrame.time_groupby_ordered(<class 'int'>, <class 'str'>, 'last')
-         174±2ms      2.65±0.09ms     0.02  groupby.CategoricalFrame.time_groupby_ordered(<class 'int'>, <class 'int'>, 'last')
-         175±2ms      2.61±0.05ms     0.01  groupby.CategoricalFrame.time_groupby(<class 'int'>, <class 'int'>, 'last')

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

@rtlee9 rtlee9 force-pushed the cat_groupby_fix branch 4 times, most recently from 6498d6b to 9377e8d Compare April 28, 2020 14:55
@@ -472,6 +473,29 @@ def _cython_operation(

is_datetimelike = needs_i8_conversion(values.dtype)
is_numeric = is_numeric_dtype(values.dtype)
is_categorical = is_categorical_dtype(values)
cat_method_blacklist = (
"add",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah i think its ok to simply update this test

@TomAugspurger
Copy link
Contributor

@rtlee9 Can you merge master and fix the conflict?

@rtlee9
Copy link
Contributor Author

rtlee9 commented May 9, 2020

Fixed conflict and merged in 7570425

@@ -239,7 +239,7 @@ Other API changes
- ``loc`` lookups with an object-dtype :class:`Index` and an integer key will now raise ``KeyError`` instead of ``TypeError`` when key is missing (:issue:`31905`)
- Using a :func:`pandas.api.indexers.BaseIndexer` with ``count``, ``min``, ``max``, ``median``, ``skew``, ``cov``, ``corr`` will now return correct results for any monotonic :func:`pandas.api.indexers.BaseIndexer` descendant (:issue:`32865`)
- Added a :func:`pandas.api.indexers.FixedForwardWindowIndexer` class to support forward-looking windows during ``rolling`` operations.
-
- :meth:`DataFrame.groupby` aggregations of categorical series will now return a :class:`Categorical` while preserving the codes and categories of the original series
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this some other issue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a (desirable) side effect of this PR - please find more context in this thread.

@@ -494,6 +519,17 @@ def _cython_operation(
values = ensure_int_or_float(values)
elif is_numeric and not is_complex_dtype(values):
values = ensure_float64(values)
elif is_categorical:
if how in self._cat_method_blacklist:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really don't like doing this. Can you elaborate when we can actually process this? listing methods is a bad idea generally.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not exactly sure what you mean by "when we can actually process this" this but I agree that listing methods isn't necessarily thorough and isn't robust. However, I've been unable to find a suitable alternative to blacklisting methods where we don't want to apply the aggregation on the category codes -- open to other ideas though. Please find more context in this thread.

@jreback
Copy link
Contributor

jreback commented May 25, 2020

@rtlee9 if you'd address the comments and merge master

@rtlee9 rtlee9 force-pushed the cat_groupby_fix branch 2 times, most recently from acc330a to 46e1e8e Compare May 29, 2020 04:51
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if u can merge master

i don't really like the specific method blacklist - if u can find a better way

@simonjayhawkins
Copy link
Member

@rtlee9 can you move release note to 1.2 and merge upstream/master to resolve conflict

@rtlee9 rtlee9 force-pushed the cat_groupby_fix branch 3 times, most recently from 964ac84 to 725c6d2 Compare August 2, 2020 21:25
@@ -132,6 +133,7 @@ Plotting
Groupby/resample/rolling
^^^^^^^^^^^^^^^^^^^^^^^^

- :meth:`DataFrame.groupby` aggregations of categorical series will now return a :class:`Categorical` while preserving the codes and categories of the original series
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add this PR as the issue (or relevant issue if discussed elsewhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the PR as the issue in commit 0fd3dde

Aggregate categorical codes with fast cython aggregation for select
`how` operations.

8/1/20: rebase and move release note to 1.2
8/2/20: Update tests to expect categorical back
8/3/20: add PR as issue for whatsnew groupby api change
@dsaxton
Copy link
Member

dsaxton commented Sep 16, 2020

@rtlee9 Is this still active? Some merge conflicts that need resolving.

@@ -356,6 +357,29 @@ def get_group_levels(self) -> List[Index]:

_name_functions = {"ohlc": ["open", "high", "low", "close"]}

_cat_method_blacklist = (
"add",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what methods don't work?

@arw2019
Copy link
Member

arw2019 commented Nov 6, 2020

Closing for now. @rtlee9 ping us whenever you'd like to continue and we'll reopen!

@arw2019 arw2019 closed this Nov 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Performance Memory or execution speed performance Stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Categorical columns are slow in groupby operations
8 participants