Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: cache _get_cython_function in groupby ops #40178

Conversation

jorisvandenbossche
Copy link
Member

Looking up the exact cython function is relatively costly, and is also something that is done repeatedly for a limited number of possible cases, so a good target for caching.

Using the GroupManyLabels benchmark we have:

class GroupManyLabels:
params = [1, 1000]
param_names = ["ncols"]
def setup(self, ncols):
N = 1000
data = np.random.randn(N, ncols)
self.labels = np.random.randint(0, 100, size=N)
self.df = DataFrame(data)
def time_sum(self, ncols):
self.df.groupby(self.labels).sum()

ncols = 1000
N = 1000
data = np.random.randn(N, ncols)
labels = np.random.randint(0, 100, size=N)
df = pd.DataFrame(data)
df_am = df._as_manager('array')

I get

In [2]: %timeit df_am.groupby(labels).sum()
72.6 ms ± 1.24 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) <-- master
66.4 ms ± 1.43 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) <-- PR

(the difference is not huge, but I am going to do several PRs with small changes that give a few ms improvement. And with all those together I hope to get the total time down to around 20ms, and in that case 5ms is significant)

@jorisvandenbossche jorisvandenbossche added Groupby Performance Memory or execution speed performance labels Mar 2, 2021


@functools.lru_cache(maxsize=None)
def _get_cython_function(kind: str, how: str, dtype: np.dtype, is_numeric: bool):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is a straight cut&paste of the existing method. But moving it to a standalone function instead of method to be able to use lru_cache on it.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm.

@jreback jreback added this to the 1.3 milestone Mar 2, 2021
@jbrockmendel
Copy link
Member

LGTM

# only valid for numeric
f = getattr(libgroupby, ftype, None)
if f is not None and is_numeric:
return f
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any idea how many cases make it past here? might be we can make some of this unnecessary on the cython side

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quite some I think. Eg for sum, we have group_add_.. which only has type-specific functions exposed (the same for mean, prod, var). Not directly sure why we expose some functions with numeric fused type, and others we expose the type specific ones (because those are also coded with fused types). That's a bigger change for another PR, though ;)

(now, an expensive part of this function is dtype.name, which is above this line anyway, so won't change that it's useful to cache this I think)

@jorisvandenbossche jorisvandenbossche merged commit 38b5e4a into pandas-dev:master Mar 3, 2021
@jorisvandenbossche jorisvandenbossche deleted the perf-groupby-ops-cache-cython-func branch March 3, 2021 08:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Groupby Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants