Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: add masked algorithm for mean() #34754

Closed
jorisvandenbossche opened this issue Jun 13, 2020 · 6 comments · Fixed by #34814
Closed

ENH: add masked algorithm for mean() #34754

jorisvandenbossche opened this issue Jun 13, 2020 · 6 comments · Fixed by #34814
Assignees
Labels
Enhancement NA - MaskedArrays Related to pd.NA and nullable extension arrays Numeric Operations Arithmetic, Comparison, and Logical operations Reduction Operations sum, mean, min, max, etc.
Milestone

Comments

@jorisvandenbossche
Copy link
Member

Similarly as we now have masked implementations for sum, prod, min and max for the nullable integer array (first PR #30982, now lives at https://github.com/pandas-dev/pandas/blob/master/pandas/core/array_algos/masked_reductions.py), we can add one for the mean reduction as well.

Very rough check gives a nice speed-up:

In [27]: arr = pd.array(np.random.randint(0, 1000, 1_000_000), dtype="Int64") 

In [28]: arr[np.random.randint(0, 1_000_000, 1000)] = pd.NA 

In [30]: arr._reduce("mean") 
Out[30]: 499.27095868772903

In [31]: %timeit arr._reduce("mean") 
7.26 ms ± 335 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [32]: arr._data.sum(where=~arr._mask, dtype="float64") / (~arr._mask).sum() 
Out[32]: 499.27095868772903

In [33]: %timeit arr._data.sum(where=~arr._mask, dtype="float64") / (~arr._mask).sum()  
2.08 ms ± 6.89 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

The nanmean version lives here: https://github.com/pandas-dev/pandas/blob/master/pandas/core/nanops.py#L517
And as reference, numpy is also adding a version that accepts a mask: numpy/numpy#15852 (which could be used in the future, and as inspiration for the implementation now).

@jorisvandenbossche jorisvandenbossche added Enhancement Numeric Operations Arithmetic, Comparison, and Logical operations NA - MaskedArrays Related to pd.NA and nullable extension arrays labels Jun 13, 2020
@jorisvandenbossche jorisvandenbossche added this to the Contributions Welcome milestone Jun 13, 2020
@Akshatt
Copy link
Contributor

Akshatt commented Jun 14, 2020

Hey @jorisvandenbossche, Can I work on this issue? I'm new to open sourcing

@jorisvandenbossche
Copy link
Member Author

Sure!

@Akshatt
Copy link
Contributor

Akshatt commented Jun 15, 2020

take

@Akshatt
Copy link
Contributor

Akshatt commented Jun 15, 2020

Hey @jorisvandenbossche, Please correct me if i'm wrong anywhere!

I've gone through the numpy version of masked mean and implemented a mean function in the masked_reductions.py file.

How do I test the time of this function relative to the older one?

@jorisvandenbossche
Copy link
Member Author

@Akshatt the easiest is if you already open a PR with what you've got (you can indicate the PR as "draft" and eg put WIP in the title), that makes it easier to give feedback

How do I test the time of this function relative to the older one?

You can do it similarly as what I showed in the top post in this issue. I am using the %timeit magic in IPython

@Akshatt
Copy link
Contributor

Akshatt commented Jun 16, 2020

@jorisvandenbossche Okay, got it! I've created a draft pull request #34814.
and also specifically, I don't get how to access the mean function that I've created.
Could you shed some light on how to do that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement NA - MaskedArrays Related to pd.NA and nullable extension arrays Numeric Operations Arithmetic, Comparison, and Logical operations Reduction Operations sum, mean, min, max, etc.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants