Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: add signature argument to vectorize for vectorizing like generalized ufuncs #8054

Merged
merged 6 commits into from
Oct 18, 2016

Conversation

shoyer
Copy link
Member

@shoyer shoyer commented Sep 15, 2016

I brought this up on the mailing list today and received some support, so I
thought I would share a preliminary patch.

Still needs tests and more extensive documentation.

Example usage (from the docstring):

Vectorized calculation of Pearson correlation coefficient and its p-value:

>>> import scipy.stats
>>> pearsonr = np.vectorize(scipy.stats.pearsonr,
...                         signature='(n),(n)->(),()')
>>> pearsonr([[0, 1, 2, 3]], [[1, 2, 3, 4], [4, 3, 2, 1]])
(array([ 1., -1.]), array([ 0.,  0.]))

Vectorized convolution:

>>> convolve = np.vectorize(np.convolve, signature='(n),(m)->(k)')
>>> convolve(np.eye(4), [1, 2, 1])
array([[ 1.,  2.,  1.,  0.,  0.,  0.],
       [ 0.,  1.,  2.,  1.,  0.,  0.],
       [ 0.,  0.,  1.,  2.,  1.,  0.],
       [ 0.,  0.,  0.,  1.,  2.,  1.]])

A note on speed: a vectorize with signature has about 6x the fixed overhead of vectorize (60us vs 10us) and also a much larger overhead per loop (2us vs 100ns). I don't think these numbers will make any difference for the intended use case (only trivial Python functions runs in less than 10us) but I thought I would toss them out there anyways. I have yet to make any attempts at optimization.

# run on python 3.5.2
import numpy as np
import operator

vectorize_add = np.vectorize(operator.add, otypes='i')
guvectorize_add = np.vectorize(operator.add, signature='(),()->()', otypes='i')

x = np.ones(1000000, dtype=int)

%timeit vectorize_add(1, 1)
#100000 loops, best of 3: 10.6 µs per loop

%timeit guvectorize_add(1, 1)
#10000 loops, best of 3: 63.4 µs per loop

%timeit vectorize_add(x, x)
#10 loops, best of 3: 112 ms per loop

%timeit guvectorize_add(x, x)
#1 loop, best of 3: 2.22 s per loop

Also fixes #5868

@@ -1,3 +1,5 @@
.. _c-api.generalized-ufuncs:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to add a reference to this doc page from the docstring for guvectorize. Do we have a standard way to do that? I guess a URL under "References" is probably most robust, though I was thinking a sphinx reference might work...

@shoyer
Copy link
Member Author

shoyer commented Sep 26, 2016

I've added tests, so this is ready for review.

Copy link
Contributor

@juliantaylor juliantaylor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like a nice feature

do we need a new api name guvectorize? could this be put into vectorize with an optional signature argument?

core_dims : Tuple[str, ...]
Core dimensions for this argument.
"""
if core_dims:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if not core_dims: return instead of an extra indent level for the whole function might be nicer

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed, done

self.pyfunc = pyfunc
self.signature = signature

if not re.match(_SIGNATURE, signature):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this part is just _parse_gufunc_signature?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, indeed, good catch

_SIGNATURE = '^%s->%s$' % (_ARGUMENT_LIST, _ARGUMENT_LIST)


def _parse_gufunc_signature(signature):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't these functions just be private methods of the guvectorize class?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer writing functions for isolated functionality. Among other virtues, it makes things easier to test.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could use @classmethod, can be tested in isolation of creating the object.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, could use @classmethod or @staticmethod here, but my feeling is that it's poor style to use a class merely as a namespace. (It's actually against our internal style guide.)

@shoyer
Copy link
Member Author

shoyer commented Sep 30, 2016

do we need a new api name guvectorize? could this be put into vectorize with an optional signature argument?

@juliantaylor This is a good point. A signature argument would indeed be a cleaner way to do this. I will give that a try.

Example usage (from the docstring):

    Vectorized convolution:

    >>> convolve = np.vectorize(np.convolve, signature='(n),(m)->(k)')
    >>> convolve(np.eye(4), [1, 2, 1])
    array([[ 1.,  2.,  1.,  0.,  0.,  0.],
           [ 0.,  1.,  2.,  1.,  0.,  0.],
           [ 0.,  0.,  1.,  2.,  1.,  0.],
           [ 0.,  0.,  0.,  1.,  2.,  1.]])
@shoyer shoyer changed the title ENH: add guvectorize for vectorizing like generalized ufuncs ENH: add signature argument to vectorize for vectorizing like generalized ufuncs Sep 30, 2016
@shoyer
Copy link
Member Author

shoyer commented Sep 30, 2016

This is ready for further review.

I rewrote this to simply add a signature argument to vectorize instead of writing a whole new function. It is indeed much cleaner this way.

Tuple of input and output core dimensions parsed from the signature, each
of the form List[Tuple[str, ...]].
"""
if not re.match(_SIGNATURE, signature):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this test carry much overhead? My always-try-to-optimize the normal path temptation would be to put the return statement inside a try/except clause like

try:
    return ...
except Exception:
    if not re.match...
        raise ValueError(...)
    else:
        raise

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the parsing is not really performance relevant, it is done on the object creation, the time consuming part is executing it on the data
but if you want to improve it, precompiling the regex might be worthwhile

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @juliantaylor that this is unlikely to be a performance bottleneck, given that it is done on construction. My timing shows it takes around 10us, which is really in micro-benchmark territory.

My understanding is that in most cases it isn't necessary to worry about precompiling a regex because the re module maintains its own cache: https://docs.python.org/2/library/re.html#re.compile

@@ -2242,17 +2243,116 @@ def disp(mesg, device=None, linefeed=True):
return


# See http://docs.scipy.org/doc/numpy/reference/c-api.generalized-ufuncs.html
_DIMENSION_NAME = r'\w+'
_CORE_DIMENSION_LIST = '(?:%s(?:,%s)*)?' % (_DIMENSION_NAME, _DIMENSION_NAME)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here and below: being relative new to python, I got used to new-style formats immediately, and always found the old-style ones a bit clunky. Would it be an idea to try to use new-style?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, that's a little better. Neither is especially readable for building regular expressions, to be honest.

@mhvk
Copy link
Contributor

mhvk commented Oct 3, 2016

This looks really nice. Only two trivial comments, which you can feel free to ignore.

@shoyer shoyer added this to the 1.12.0 release milestone Oct 3, 2016
@shoyer
Copy link
Member Author

shoyer commented Oct 7, 2016

Note to self: this would be a good time to fix #5868, by adding a sensible error message if otypes is not provided and the output is empty.

@shoyer
Copy link
Member Author

shoyer commented Oct 7, 2016

CC @mforbes (the last person to significantly update vectorize)

r = f([0, 3, 6, 9], [1, 3, 5, 7])
assert_array_equal(r, [1, 6, 1, 2])

def test_siganture_mean_last(self):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

siganture is a mispelling

@shoyer
Copy link
Member Author

shoyer commented Oct 11, 2016

I added the fix for size 0 inputs, in case anyone wants to take another look.

@shoyer
Copy link
Member Author

shoyer commented Oct 17, 2016

I'd love to merge this in time for v1.12.0.

@mhvk @juliantaylor would one of you mind taking another look at my revisions? Thanks.

% (len(input_core_dims), len(args)))
args = tuple(asanyarray(arg) for arg in args)

# consider checking for size 0 inputs?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this comment still necessary? You do seem to check for size 0 things at the end.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, this is outdated. Removed.

@mhvk
Copy link
Contributor

mhvk commented Oct 17, 2016

@shoyer - yes, indeed, this should be in 1.2, it is a very nice addition! I looked over the code fairly carefully again, and only had a a comment about a comment...

@charris
Copy link
Member

charris commented Oct 17, 2016

@shoyer Sounds like time to cleanup the commit history.

@shoyer
Copy link
Member Author

shoyer commented Oct 17, 2016

@charris how do you feel about GitHub's "Squash and merge" option? I think that can result in something that looks very similar to what I would make by hand here.

@charris
Copy link
Member

charris commented Oct 17, 2016

@shoyer Don't like it, it squashes, then fast forwards the commit to master, so there is no record of the merge or the PR #. Some folks dislike the "bubble" git history resulting from lots of small merges, but for us I think it is a plus as both the authors and committers are documented along with the PR #, which makes it easy to automate part of the release documentation and link to the relevant PR. Maybe we could fix up the github function at some point if there is a hook. Don't remember what happens with the documentation.

@shoyer
Copy link
Member Author

shoyer commented Oct 17, 2016

@charris OK, no problem I will rebase.

I disabled "Allow squash merging" and "Allow rebase merging" in GitHub settings so those buttons no longer tempt me :).

I also disabled force pushing to master while I was at it, since I'm pretty sure we don't want to allow that and this avoids accidental force pushes (which can happen).

@charris
Copy link
Member

charris commented Oct 17, 2016

That said, I only ran the experiment once, so I might have missed some options along the way. Might be worth creating a test repo and experimenting with the buttons.

@shoyer
Copy link
Member Author

shoyer commented Oct 17, 2016

"Squash and merge" puts everything in one commit, which by default has the PR number in the title, e.g., for this PR:

ENH: add signature argument to vectorize for vectorizing like generalized ufuncs (#8054)

The bodies for each commit get combined into the body of the squashed commit.

I like it because figuring out how to rebase to cleanup git history can be a burden for new committers.

@charris
Copy link
Member

charris commented Oct 18, 2016

Hmm, so the commit message comes from the PR title rather than the first commit? What about the commit message body, was that an option to add?

@shoyer
Copy link
Member Author

shoyer commented Oct 18, 2016

@charris You should try clicking the "Squash and merge" button once (I just re-enabled it). It doesn't merge until you click a second time to confirm, and it will show you an editable preview of the commit message.

@charris
Copy link
Member

charris commented Oct 18, 2016

I think there is documentation saying that issue numbers should be part to the commit message summary line. We should change that if we allow squashing so we can reliably pull out the PR number as opposed to issues. See the tools/announce.py script to see how that done now by searching merges only. I'd still like to document the committer as well as the contributor. If we could disable the fast-forwarding I'd be happy. I think...

@charris
Copy link
Member

charris commented Oct 18, 2016

A quick grep on the commit log exposes what looks like three uses of the squash button. One is mine, I suppose the other two are yours.

@charris
Copy link
Member

charris commented Oct 18, 2016

Another solution would be to not bother squashing. My thinking about this is colored by trying to generate useful logs automatically as well as tracking committers. But I agree that the squashed commit message has much to recommend it and loss of merge tracking might not be so bad.

@charris charris merged commit 0a02bb6 into numpy:master Oct 18, 2016
@charris
Copy link
Member

charris commented Oct 18, 2016

Giving it a shot. I think the button needs to be used with discrimination.

@charris
Copy link
Member

charris commented Oct 18, 2016

OK, one thing that is off is that the summary line of the commit, taken from the PR title, is too long at 88 characters. So one thing to do when using this option is to edit the title line to fix it up. Note that the messages from the consolidated commits also get indented an extra four spaces. All in all, I think manually squashing is a better option at this point but using the squash button is a possiblity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

vectorize fails for zero-dimensioned arrays
4 participants