Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REF: remove block access in groupby libreduction Series(Bin)Grouper #40199

Conversation

jorisvandenbossche
Copy link
Member

xref #39146

The Series(Bin)Grouper still has some direct block access, which needs to be resolved to let it work for both Block and ArrayManager

@jorisvandenbossche jorisvandenbossche added Groupby Internals Related to non-user accessible pandas implementation labels Mar 3, 2021
Comment on lines 70 to 75
if self.has_block:
object.__setattr__(cached_typ._mgr._block, 'values', vslider.buf)
object.__setattr__(cached_typ._mgr._block, 'mgr_locs',
slice(len(vslider.buf)))
else:
cached_typ._mgr.arrays[0] = vslider.buf
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could probably replace this full if/else block with a single cached_typ._mgr.set_values(vslider.buf) (if I add the setting of mgr_locs to SingleBlockManager.set_values).

But I assume that, currently, we don't use a plain python attribute setting but object.__setattr__ for performance? Using _mgr.set_values(..) might defeat that purpose?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the newer commit I pushed does this change, and is thus now using _mgr.set_values.

I did some timings with a specific function that uses the SeriesGrouper with many labels + a cheap dummy function (adapted from the same benchmark case as I have been using for the other groupby PRs: #40178 (comment)):

ncols = 1000
N = 1000
data = np.random.randn(N, ncols)
labels = np.random.randint(0, 100, size=N)
df = pd.DataFrame(data)

%timeit df.groupby(labels)[0].agg(lambda x: 1)

And repeating this several times switching back and forth between this version and master, I don't see any difference. A representative timing was:

In [15]: %timeit df.groupby(labels)[0].agg(lambda x: 1)
1.18 ms ± 111 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)  <-- master
1.17 ms ± 88.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)  <-- PR

@@ -66,9 +66,7 @@ cdef class _BaseGrouper:
object.__setattr__(cached_ityp, '_index_data', islider.buf)
cached_ityp._engine.clear_mapping()
cached_ityp._cache.clear() # e.g. inferred_freq must go
object.__setattr__(cached_typ._mgr._block, 'values', vslider.buf)
object.__setattr__(cached_typ._mgr._block, 'mgr_locs',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might get a small boost by setting _mgr_locs to BlockPlacement(slice(...)) instead of going through the mgr_locs property

@jorisvandenbossche jorisvandenbossche changed the title [ArrayManager] Fix groupby libreduction Series(Bin)Grouper REF: remove block access in groupby libreduction Series(Bin)Grouper Mar 3, 2021
@jreback jreback added this to the 1.3 milestone Mar 4, 2021
@jreback
Copy link
Contributor

jreback commented Mar 4, 2021

typing issue on ci / checks

@jorisvandenbossche
Copy link
Member Author

It's the failure that was failing on master yesterday. Will merge master to be sure.

@jorisvandenbossche jorisvandenbossche merged commit 81114eb into pandas-dev:master Mar 4, 2021
@jorisvandenbossche jorisvandenbossche deleted the am-groupby-libreduction-series branch March 4, 2021 16:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Groupby Internals Related to non-user accessible pandas implementation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants