Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement flexible shape/dims/size API #4625

Merged
merged 3 commits into from
Apr 20, 2021
Merged

Conversation

michaelosthege
Copy link
Member

@michaelosthege michaelosthege commented Apr 8, 2021

Terminology

  • implied dimensions are those that a RV takes without specifying any of shape,dims,size,observed.
  • The size of an RV are all dimensions in addition to (preceeding) the implied dimensions.

Key Takeaways

  • shape can be numeric/symbolic and supports Ellipsis in the last position. Without Ellipsis a SpecifyShape Op is added automatically.
  • dims can be None and are defined at first mention. Ellipsis is also supported. In combination with pm.Data the dimensions specified through dims are re-sizeable.
  • size does care about implied dimensions in any way and just adds new dimensions to the RV. Internally size is inferred from shape and dims.
  • No more resizing in Model.make_obs_var because it leaves code in Distribution.__new__ not knowing the output shape.

ToDo

  • Distribution.dist() and Distribution.__new__ are similar in use (dims and observed only through __new__)
  • Inputs are checked and informative errors are raised where possible
  • Release notes were updated
  • Some tests were edited to use size/shape appropriately -- some tests illegally used size to specify size and implied dimensions.

ToDo (optional)

  • Ideally, it should be possible to pass dims in combination with size. We should just require a single source of truth (dim names must be undefined for dimensions that size integers apply to).

pymc3/distributions/distribution.py Outdated Show resolved Hide resolved
pymc3/distributions/distribution.py Outdated Show resolved Hide resolved
Copy link
Contributor

@brandonwillard brandonwillard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This introduces a considerable amount of complexity into Distribution.__new__; the shape logic really needs its own function/method, at the very least.

Also, this PR unnecessarily reinstates the use of shape internally (e.g. numerous change from size to shape in the tests), which is the opposite of what we set out to do.

pymc3/distributions/distribution.py Outdated Show resolved Hide resolved
pymc3/distributions/distribution.py Outdated Show resolved Hide resolved
pymc3/distributions/distribution.py Outdated Show resolved Hide resolved
pymc3/tests/test_distributions_random.py Show resolved Hide resolved
@michaelosthege
Copy link
Member Author

Caution: The API from aesara.tensor.random.basic Ops such as normal resembles the NumPy API:

x = aesara.shared(np.array([10,20,30,40]))
n = atr.normal(loc=x, scale=0.1, size=(2,4))     # size=(2,) does not work!

In PyMC3 this will equivalent to "inflexible" use of shape:

with pm.Model():
    x = aesara.shared(np.array([10,20,30,40]))
    pm.Normal("n1", mu=x, sd=0.1, shape=(2,4))
    pm.Normal("n2", mu=x, sd=0.1, size=(2,))

@michaelosthege
Copy link
Member Author

michaelosthege commented Apr 11, 2021

With the latest push the Distribution() and Distribution.dist() APIs are now aligned and validation/warning/casting logic moved to a helper function.

dims are not available through the .dist() API until dimension name broadcasting is implemented at the Aesara Variable level.

Passing a non-Ellipsis shape like shape=(1, 2, 3, 4) is the backwards-compatible use of the API that, at the same time results in the automatic addition of specify_shape.
The SpecifyShape Op couldn't deal with scalar shapes (), so the corresponding tests will fail until the fixes from pymc-devs/pytensor#367 are available.

A BestPracticeWarning is emitted in the following scenarios:

  • pm.Normal(..., shape=3) → "The shape parameter should be a list or tuple."
  • pm.Normal(..., dims="country") → "The dims parameter should be a list or tuple."

No BestPracticeWarnings are emitted in the following scenarios:

  • pm.Normal(..., mu=[1,2], shape=(3, 2)) → Automatic specify_shape will alert the user when the implied shape changes.
  • pm.Normal(..., mu=[1,2], shape=(3, ...)) → Notation indicates that there are more than 1 dims (debatable if we should nudge towards size).

Before we invest too much time into shape/size arguments, I would like to point out that dims are 10x more useful, because they make working with the resulting InferenceData much easier.

@OriolAbril
Copy link
Member

Before we invest too much time into shape/size arguments, I would like to point out that dims are 10x more useful, because they make working with the resulting InferenceData much easier.

💯

pymc3/distributions/distribution.py Outdated Show resolved Hide resolved
@brandonwillard
Copy link
Contributor

brandonwillard commented Apr 11, 2021

dims are not available through the .dist() API until dimension name broadcasting is implemented at the Aesara Variable level.

To be clear, this is not a change that we should make in Aesara. Dimension names are completely unrelated to the concerns of Aesara and we absolutely must keep the objects in that library as simple and efficient as possible.

That does not mean that we cannot consider creating a custom TensorVariable or TensorType class for this purpose in PyMC3; however, I would still hesitate to do that unless it's absolutely necessary.

For instance, why can't we just use Variable.tag to hold these dimension labels and reason with/about them in PyMC3?

@brandonwillard
Copy link
Contributor

A BestPracticeWarning is emitted in the following scenarios:

  • pm.Normal(..., shape=3) → "The shape parameter should be a list or tuple."
  • pm.Normal(..., dims="country") → "The dims parameter should be a list or tuple."

These are superfluous and oddly strict. NumPy doesn't even complain about this; why should PyMC?

@twiecki
Copy link
Member

twiecki commented Apr 11, 2021

A BestPracticeWarning is emitted in the following scenarios:

  • pm.Normal(..., shape=3) → "The shape parameter should be a list or tuple."
  • pm.Normal(..., dims="country") → "The dims parameter should be a list or tuple."

These are superfluous and oddly strict. NumPy doesn't even complain about this; why should PyMC?

I agree, I don't think a warning is required here.

@michaelosthege
Copy link
Member Author

I agree, I don't think a warning is required here.

Okay, this leaves us with no BestPracticeWarning at all.

dims=("town", ...) now works too. The ...-implied dims are set to he default pattern of f"{name}_dim_{d}" where d counting includes the size dimensions, like it would if there weren't any dim names at all.
For those implied dim names, None is set in the Model.RV_dims[rv_name] tuple of coordinate values. @OriolAbril is that something we can work with downstream, or would you prefer a different appraoch?

pymc3/distributions/distribution.py Outdated Show resolved Hide resolved
pymc3/distributions/distribution.py Outdated Show resolved Hide resolved
RELEASE-NOTES.md Outdated Show resolved Hide resolved
OriolAbril
OriolAbril previously approved these changes Apr 18, 2021
Copy link
Member

@OriolAbril OriolAbril left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These improvements on dims are amazing and I personally think most user code will soon use mostly dims instead of either of shape or size.

RELEASE-NOTES.md Outdated Show resolved Hide resolved
pymc3/distributions/distribution.py Outdated Show resolved Hide resolved
pymc3/distributions/distribution.py Outdated Show resolved Hide resolved
pymc3/model.py Outdated Show resolved Hide resolved
pymc3/model.py Outdated Show resolved Hide resolved
pymc3/model.py Outdated Show resolved Hide resolved
Also refactor into properties to add docstrings and type annotations.
And no longer allow InferenceData conversion without a Model on stack.

Co-authored-by: Oriol Abril Pla <oriol.abril.pla@gmail.com>
Copy link
Contributor

@brandonwillard brandonwillard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a lot of unnecessary shape keyword changes in unrelated tests that need to be reverted. As well, there seem to be some similarly unrelated non-shape test changes that need to be reverted.

Otherwise, the logic in Distribution.[__new__|dist] is still more complicated than it needs to be and should be better organized and distributed between the Model and Distribution classes.

Under these changes, Distribution is taking on steps that are—and were—performed in Model, and this is likely the cause of the current failures. Without an unusually compelling reason for this previously Model-specific logic to be relocated to Distribution, these changes are not sound.

For instance, the steps in Distribution that are performed based on the value of observed are actually Model-exclusive, because observations are a Model-only concept and feature. By moving—or adding—parts of that logic into Distribution, we fragment our logic and violate the abstractions that underlie these classes. The end result is code that's harder to follow—because the underlying abstractions are much less relevant to the actual implementation—and exceedingly more difficult to maintain and extend, due to a less manageable back-and-forth between the two classes.

In the end, these changes should only require a single call in Distribution.__new__ to a function that—at maximum—takes the size, shape, dims, rv_op, and distribution parameters as arguments, determines the appropriate size to be passed to Distribution.dist, and that's all.

This function could be a method provided by Model, which would make it reasonable to also perform any relevant Model updates that involve dims within it.

The only step that seems like it could/should be in Distribution is the application of specify_shape when shape is used, but I'm not seeing that in this diff.

RELEASE-NOTES.md Outdated Show resolved Hide resolved
pymc3/distributions/distribution.py Outdated Show resolved Hide resolved
if dims is not None:
if Ellipsis in dims:
# Auto-complete the dims tuple to the full length
dims = (*dims[:-1], *[None] * rv_out.ndim)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the RandomVariable created first, and then resized, only because rv_out.ndim might be needed in this step?

Regardless, rv_out.ndim is obtainable without first creating the underlying RandomVariable.

Comment on lines 266 to 289
n_resize = len(dims) - n_implied

if "shape" in kwargs:
raise DeprecationWarning("The `shape` keyword is deprecated; use `size`.")
# All resize dims must be known already (numerically or symbolically).
unknown_resize_dims = set(dims[:n_resize]) - set(model.dim_lengths)
if unknown_resize_dims:
raise KeyError(
f"Dimensions {unknown_resize_dims} are unknown to the model and cannot be used to specify a `size`."
)

transform = kwargs.pop("transform", UNSET)
# The numeric/symbolic resize tuple can be created using model.RV_dim_lengths
resize = tuple(model.dim_lengths[dname] for dname in dims[:n_resize])
elif observed is not None:
if not hasattr(observed, "shape"):
observed = pandas_to_array(observed)
n_resize = observed.ndim - n_implied
resize = tuple(observed.shape[d] for d in range(n_resize))

if resize:
# A batch size was specified through `dims`, or implied by `observed`.
rv_out = change_rv_size(rv_var=rv_out, new_size=resize, expand=True)

if dims is not None:
# Now that we have a handle on the output RV, we can register named implied dimensions that
# were not yet known to the model, such that they can be used for size further downstream.
for di, dname in enumerate(dims[n_resize:]):
if not dname in model.dim_lengths:
model.add_coord(dname, values=None, length=rv_out.shape[n_resize + di])
Copy link
Contributor

@brandonwillard brandonwillard Apr 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, this logic needs to be factored out into a simple and clear function that—for instance—computes an appropriate value of size (to be passed to Distribution.dist) using values from dims or shape.

This is especially important given that this code also updates the model's coordinates, making it easily more relevant as a method in Model.

In other words, it's best to keep the Model-specific logic in Model, and not spread out across Distribution and Model.

Also, the steps related to observed are apparently duplicating some of the steps taken later in Model.make_obs_var. This might be yet another good reason for moving this logic into Model.

Comment on lines +337 to +336
# Create the RV without specifying size or testval.
# The size will be expanded later (if necessary) and only then the testval fits.
rv_native = cls.rv_op(*dist_params, size=None, **kwargs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This demonstrates a way to obtain the symbolic shape for a RandomVariable without actually creating one.

pymc3/tests/test_shape_handling.py Outdated Show resolved Hide resolved
pymc3/tests/test_shape_handling.py Show resolved Hide resolved
pymc3/tests/test_shape_handling.py Outdated Show resolved Hide resolved
@@ -982,7 +982,7 @@ def test_linalg(self, caplog):
a = Normal("a", size=2, testval=floatX(np.zeros(2)))
a = at.switch(a > 0, np.inf, a)
b = at.slinalg.solve(floatX(np.eye(2)), a)
Normal("c", mu=b, size=2, testval=floatX(np.r_[0.0, 0.0]))
Normal("c", mu=b, shape=(2,), testval=floatX(np.r_[0.0, 0.0]))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Normal("c", mu=b, shape=(2,), testval=floatX(np.r_[0.0, 0.0]))
Normal("c", mu=b, size=2, testval=floatX(np.r_[0.0, 0.0]))

Copy link
Member Author

@michaelosthege michaelosthege Apr 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as with other examples: The way size was used before the testval shape does not match. Like with other examples, this was likely an unintended change of behavior from the original test when it was switched to size.

Comment on lines +277 to +281
def build_model(self, distfam, params, *, size=None, shape=None, transform=None, testval=None):
if testval is not None:
testval = pm.floatX(testval)
with pm.Model() as m:
distfam("x", size=size, transform=transform, testval=testval, **params)
distfam("x", size=size, shape=shape, transform=transform, testval=testval, **params)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of the changes in this module need to be reverted; they have nothing to do with the shape or dims feature, nor should they.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of the changes in this module are because many of these tests are actually parametrized by shape.
Consider this test case from before the change:

    @pytest.mark.parametrize(
        "a,b,size",
        [
            (1.0, 1.0, (2,)),                               # scalar is implied; 2 replicate dimensions
            (np.ones(3), np.ones(3), (4, 3)),               # (3,) is implied; (4, 3) replicate dimensions
        ],
    )
    def test_beta_ordered(self, a, b, size):
        testval = np.sort(np.abs(np.random.rand(*size)))    # but the testval is at most 2-dimensional
        model = self.build_model(
            pm.Beta,
            {"alpha": a, "beta": b},
            size=size,                     # in the second parametrize case the RV is expanded to (4, 3, 3)
            testval=testval,               # but the testval is only (4, 3)
            transform=tr.Chain([tr.logodds, tr.ordered]),
        )
        self.check_vectortransform_elementwise_logp(model, vect_opt=0)

Comparing with master we can see that the test was up to 2-dimensional to begin with. The change to size had changed the behaviour of the test - I'm just reverting it to what was originally intended & tested.
https://github.com/pymc-devs/pymc3/blob/2dee9dafb3541043ae23b66fe9262e70fbd2769a/pymc3/tests/test_transforms.py#L405-L421

Yes, the test could be refactored to use size, by either replicating the shape←→size logic, or parametrizing the test with both shape and size manually. That would be a lot of unnecessary work and make the tests harder to work with!
Since this PR is about restoring shape as a backwards-compatible and fully supported way to parametrize RVs I do not see the point in making our lives harder by notoriously avoiding it in tests like this one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked the testvals for this test in v3 and v4 and they're both the same. The symbolic types for the resulting variables are also the same between v3 and v4, so I don't understand what's changed here—aside from the switch to size.

More importantly, this is a scalar variable, so size and shape should be the same. If they're not, then that's an altogether different issue that needs to be addressed separately.

Regardless, the change here is still an unnecessary reversion that forces these tests to use an irrelevant and newly added feature.

Always treats `size` as being in addition to dimensions implied by RV parameters.

All resizing beyond parameter-implied dimensionality is done from:
- `shape` or `size` in `Distribution.dist()`
- `dims` or `observed` in `Distribution.__new__`
and only in those two places.

Closes pymc-devs#4552.
@michaelosthege
Copy link
Member Author

The test about #4652 XPASSed: https://github.com/pymc-devs/pymc3/pull/4625/checks?check_run_id=2394141812#step:7:3572

It shouldn't have done that - looks like https://github.com/pymc-devs/pymc3/blob/v4/pyproject.toml#L2 isn't working.

With respect to the other changes...

Reasons for why some tests were changed are thoroughly commented above (see threads).

Regarding other options to obtain implied dimensionality information, I'd like to quote @brandonwillard:

it's generally cheap to make these graphs, so you can always construct one just for information
also, graph construction isn't a "run-time" thing, so it doesn't really affect performance

So that's how I implemented it. There might be other ways, but given that it's cheap, that it works and that it restores the important backwards compatibility while bringing more flexibility I don't think we should delay this PR any further.

I think the shape/dims/size/size_from_observed logic from the two Distribution APIs should not be split across distribution.py/model.py. Not only do they reference the same helper functions, but the Distribution class is also the point of entry. After moving the determination of size from observed also into Distribution.__new__ the model.py is always given RVs with the correct dimensionality already and doesn't even import change_rv_size.
Factoring out the logic within Distribution.__new__ can possibly be addressed in a follow-up PR that makes dims work alongside shape/size (see PR description).

So from my point of view this PR is all good cc @brandonwillard @OriolAbril @twiecki

Copy link
Member

@OriolAbril OriolAbril left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mostly agree with @michaelosthege last comment, but my opinion on model/distribution topics probably should not matter much as I only have a loose understanding their internals.

I do want to add an extra point on that however, as a relative sideliner. I feel that without a clear/detailed roadmap (that I know of) and with things like pymc-devs/pytensor#352 on the air, I am not sure it's possible at this point to know what is the correct level of abstraction for each thing. There seem to be arguments to move dims in either direction, towards dist and a better integration with Aesara or towards Model.

And last but definitely not least, I want to say that the more I see about the capabilities that v4 is improving/adding/extending the more amazed I am with it. Thanks to everyone that is pushing hard on this and leading the effort.

@twiecki twiecki marked this pull request as ready for review April 20, 2021 21:44
@twiecki twiecki merged commit c99f15c into pymc-devs:v4 Apr 20, 2021
@twiecki
Copy link
Member

twiecki commented Apr 20, 2021

I agree, I think this looks great. Agree with @OriolAbril that at this point it's not clear what the right level of abstraction is and I don't see a clear winner. We can always revisit when we find out later that we can improve things.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants