Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: speed-up when scalar not found in Categorical's categories #29750

Merged

Conversation

topper-123
Copy link
Contributor

@topper-123 topper-123 commented Nov 20, 2019

I took a look at the code in core/category.py and found a usage of np.repeat for creating boolean arrays. np.repeat is much slower than np.zeros/np.ones for that purpose.

>>> n = 1_000_000
>>> c = pd.Categorical(['a'] * n + ['b'] * n + ['c'] * n)
>>> %timeit c == 'x'
17 ms ± 270 µs per loop  # master
82.6 µs ± 488 ns per loop  # this PR

@topper-123 topper-123 force-pushed the perf_categorical_not_in_categories branch from fe1926a to a16416b Compare November 20, 2019 18:24
@topper-123 topper-123 changed the title PERF: scalar not found in Categorical's categories PERF: speed-up when scalar not found in Categorical's categories Nov 20, 2019
@jreback jreback added Categorical Categorical Data Type Performance Memory or execution speed performance labels Nov 21, 2019
@jreback jreback added this to the 1.0 milestone Nov 21, 2019
@jreback jreback merged commit 2570c1d into pandas-dev:master Nov 21, 2019
@jreback
Copy link
Contributor

jreback commented Nov 21, 2019

thanks @topper-123 any additional asv's for this / categorical stuff surely welcome.

@topper-123 topper-123 deleted the perf_categorical_not_in_categories branch November 21, 2019 13:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants