Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: support parallel calculation of nancorr #24795

Closed
wants to merge 1 commit into from

Conversation

noamher
Copy link
Contributor

@noamher noamher commented Jan 16, 2019

This is a proposal for using openmp to speedup the nancorr function (used by pd.DataFrame.corr).

If this is something that is useful, it can probably be implemented for other cython algorithms implemented in algos.pyx.

Also, the interface has to be decided on: how to choose whether to use parallelization or not, how many cpus, schedule strategy for the prange, etc.

I am not sure what the implications are for adding openmp to compilation and linkage in terms of portability.

Using 4 cpus I got ~60% speedup on a pd.DataFrame.corr (of size 20000 x 1300).

@pep8speaks
Copy link

pep8speaks commented Jan 16, 2019

Hello @noamher! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on January 16, 2019 at 08:57 Hours UTC

@noamher noamher changed the title Support parallel calculation of nancorr PERF: Support parallel calculation of nancorr Jan 16, 2019
@noamher noamher changed the title PERF: Support parallel calculation of nancorr PERF: support parallel calculation of nancorr Jan 16, 2019
@gfyoung gfyoung added Build Library building on various platforms Performance Memory or execution speed performance Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Needs Discussion Requires discussion from core team before further action labels Jan 16, 2019
@gfyoung
Copy link
Member

gfyoung commented Jan 16, 2019

@noamher : Thanks for opening this PR! While the performance increases sounds appealing from the onset, a couple of points that could improve this proposal:

  • Given the degree of personal uncertainty in your own implementation, we generally prefer that you open an issue PRIOR TO the PR. The issue is a better place to iron out those uncertainties before making a concrete proposal to change the code.

  • You should run the ASV benchmarks (under the asv_bench directory) to provide more concrete evidence that this is indeed a performance speedup. If the tests don't indicate any performance speedup, trying adding your new benchmark to provide that evidence.

@TomAugspurger
Copy link
Contributor

I don't think we're quite ready for this. We need to do some more discussion on when and how we'll use parallelism.


from libc.stdlib cimport malloc, free
from libc.string cimport memmove
from libc.math cimport fabs, sqrt
from cpython cimport bool
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does cython treat this differently from built-in bool?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is the type used by cython to declare a python boolean (an object which means many operations can only be taken on it when gil is taken).

int64_t nobs = 0
float64_t vx, vy, sumx, sumy, sumxx, sumyy, meanx, meany, divisor
int64_t blah = 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whats this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary, will be removed.

nancorr_single_row(mat, N, K, result, xi, mask, minpv, cov)
else:
with nogil:
for xi in range(K):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could this be collapsed with something like range_func = range(K) if not parallel else prange(K, schedule='dynamic')?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK No, because prange is not really a function, it is converted into a #pragma of sorts in the c code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I think there is an issue on Cython's GH about making a maybe_nogil that would be semantically similar to maybe_prange. Might be worth commenting there

@@ -677,10 +678,11 @@ def srcpath(name=None, suffix='.pyx', subdir='src'):
obj = Extension('pandas.{name}'.format(name=name),
sources=sources,
depends=data.get('depends', []),
include_dirs=include,
include_dirs=include + [numpy.get_include()],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this different from the pkg_resources version that is already in there?

language=data.get('language', 'c'),
define_macros=data.get('macros', macros),
extra_compile_args=extra_compile_args)
extra_compile_args=['-fopenmp'] + extra_compile_args,
extra_link_args=['-fopenmp'])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this platform-dependent in any way? i.e. do we need to check that it is available?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is the gcc flag to link against openmp and I think clang uses the same flag but I'm not sure.

@@ -230,14 +232,15 @@ def kth_smallest(numeric[:] a, Py_ssize_t k) -> numeric:

@cython.boundscheck(False)
@cython.wraparound(False)
def nancorr(ndarray[float64_t, ndim=2] mat, bint cov=0, minp=None):
def nancorr(float64_t[:, :] mat, bint cov=0, minp=None, bool parallel=False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on changing to memoryviews where viable. if we don't alter mat inplace, might want to add the const modifier.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

bint minpv,
bint cov=0) nogil:
for yi in range(xi + 1):
nancorr_single(mat, N, K, result, xi, yi, mask, minpv, cov)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if nancorr_single isn't used elsewhere, I'd inline it here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@jreback
Copy link
Contributor

jreback commented Jan 16, 2019

I agree, this is out of scope for now w/o further API discussion. We don't have a generic parallelism story ATM. this would have to be threaded down to the function and have the abilitty to generically turn parallelism on/off at a much higher level for other frameworks.

@noamher if you would open an issue and try to reference a couple of open issues (pls do a search). for discussions.

@jreback jreback closed this Jan 16, 2019
@jreback
Copy link
Contributor

jreback commented Jan 16, 2019

@noamher the memory view changes are ok, would take a PR on those.

@noamher
Copy link
Contributor Author

noamher commented Jan 16, 2019

@jreback I'll open a relevant issue and see if I can find other relevant issues.

I'll do the memory view changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Build Library building on various platforms Needs Discussion Requires discussion from core team before further action Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants