PERF: cache _get_cython_function in groupby ops #40178

jorisvandenbossche · 2021-03-02T20:03:53Z

Looking up the exact cython function is relatively costly, and is also something that is done repeatedly for a limited number of possible cases, so a good target for caching.

Using the GroupManyLabels benchmark we have:

pandas/asv_bench/benchmarks/groupby.py

Lines 133 to 145 in b835ca2

    
           class GroupManyLabels: 
        
               params = [1, 1000] 
        
               param_names = ["ncols"] 
        
               def setup(self, ncols): 
        
                   N = 1000 
        
                   data = np.random.randn(N, ncols) 
        
                   self.labels = np.random.randint(0, 100, size=N) 
        
                   self.df = DataFrame(data) 
        
               def time_sum(self, ncols): 
        
                   self.df.groupby(self.labels).sum()

ncols = 1000
N = 1000
data = np.random.randn(N, ncols)
labels = np.random.randint(0, 100, size=N)
df = pd.DataFrame(data)
df_am = df._as_manager('array')

I get

In [2]: %timeit df_am.groupby(labels).sum()
72.6 ms ± 1.24 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) <-- master
66.4 ms ± 1.43 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) <-- PR

(the difference is not huge, but I am going to do several PRs with small changes that give a few ms improvement. And with all those together I hope to get the total time down to around 20ms, and in that case 5ms is significant)

jorisvandenbossche · 2021-03-02T20:05:00Z

pandas/core/groupby/ops.py

+
+
+@functools.lru_cache(maxsize=None)
+def _get_cython_function(kind: str, how: str, dtype: np.dtype, is_numeric: bool):


This function is a straight cut&paste of the existing method. But moving it to a standalone function instead of method to be able to use lru_cache on it.

jreback

lgtm.

jbrockmendel · 2021-03-02T21:28:46Z

LGTM

jbrockmendel · 2021-03-02T22:17:51Z

pandas/core/groupby/ops.py

+    # only valid for numeric
+    f = getattr(libgroupby, ftype, None)
+    if f is not None and is_numeric:
+        return f


any idea how many cases make it past here? might be we can make some of this unnecessary on the cython side

Quite some I think. Eg for sum, we have group_add_.. which only has type-specific functions exposed (the same for mean, prod, var). Not directly sure why we expose some functions with numeric fused type, and others we expose the type specific ones (because those are also coded with fused types). That's a bigger change for another PR, though ;)

(now, an expensive part of this function is dtype.name, which is above this line anyway, so won't change that it's useful to cache this I think)

PERF: cache _get_cython_function in groupby ops

e718e2f

jorisvandenbossche added Groupby Performance Memory or execution speed performance labels Mar 2, 2021

jorisvandenbossche commented Mar 2, 2021

View reviewed changes

jreback approved these changes Mar 2, 2021

View reviewed changes

jreback added this to the 1.3 milestone Mar 2, 2021

jbrockmendel reviewed Mar 2, 2021

View reviewed changes

jorisvandenbossche merged commit 38b5e4a into pandas-dev:master Mar 3, 2021

jorisvandenbossche deleted the perf-groupby-ops-cache-cython-func branch March 3, 2021 08:38

This was referenced Mar 3, 2021

PERF: avoid zeros_like in groupby.pyx #40194

Merged

REF: remove block access in groupby libreduction Series(Bin)Grouper #40199

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: cache _get_cython_function in groupby ops #40178

PERF: cache _get_cython_function in groupby ops #40178

jorisvandenbossche commented Mar 2, 2021

jorisvandenbossche Mar 2, 2021

jreback left a comment

jbrockmendel commented Mar 2, 2021

jbrockmendel Mar 2, 2021

jorisvandenbossche Mar 3, 2021

	class GroupManyLabels:

	params = [1, 1000]
	param_names = ["ncols"]

	def setup(self, ncols):
	N = 1000
	data = np.random.randn(N, ncols)
	self.labels = np.random.randint(0, 100, size=N)
	self.df = DataFrame(data)

	def time_sum(self, ncols):
	self.df.groupby(self.labels).sum()



		@functools.lru_cache(maxsize=None)
		def _get_cython_function(kind: str, how: str, dtype: np.dtype, is_numeric: bool):

PERF: cache _get_cython_function in groupby ops #40178

PERF: cache _get_cython_function in groupby ops #40178

Conversation

jorisvandenbossche commented Mar 2, 2021

jorisvandenbossche Mar 2, 2021

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

jbrockmendel commented Mar 2, 2021

jbrockmendel Mar 2, 2021

Choose a reason for hiding this comment

jorisvandenbossche Mar 3, 2021

Choose a reason for hiding this comment