Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement logcdf method for discrete distributions #4387

Merged
merged 9 commits into from
Dec 31, 2020

Conversation

ricardoV94
Copy link
Member

@ricardoV94 ricardoV94 commented Dec 27, 2020

This PR adds logcdf methods to all univariate discrete distributions. Formulas were based in either scipy or wikipedia entries. For some distributions, I could not find specialized formulas and had to rely on summing up all the logps in tt.arange(0, value+1).

Closes #4331

Unittests:

  1. Using check_logcdf in all distributions with scipy counterpart.
  2. Added a new type of unit test check_selfconsistency_discrete_logcdf, which tests that the logcdf is equivalent to adding all the logps starting from zero domain.lower. This provides coverage to the distributions that are not in scipy (not covered by 1.). I added it to all distributions, but maybe this is too redundant for those that are covered by 1.

Some remaining issues / questions:

  1. Should all bound calls in CDF methdos be replaced by tt.switch? Most logcdf methods for continous distributions seem to use tt.switch, except for the Gamma and InverseGamma distributions. I used bound whenever there were at least two conditions that needed checking leading no -np.inf, otherwise went with tt.switch. I am also not completely sure if this interferes with the recent Add flag to disable bounds check for speed-up #4377 PR.
  2. Several logcdf methods do not work with multiple values (or even with an array containing a single value such as np.array([1])). In these cases, I excluded information in the docstrings about the possibility of computing the logcdf for multiple values, but this seems like a suboptimal solution. We should either fix it or raise a ValueError to explain the limitation to the user, as this seems to go against the expected behavior in most distributions. This occurs for two reasons:
    1. Computing logcdf by summing over the individual logps with logsumexp(tt.arange(0, value+1), keepdims=False) for the BetaBinomial and HyperGeometric. Any alternatives?
    2. Using the incomplete_beta function in the Binomial and NegativeBinomial (and ZeroInflated counterparts). This is also an issue in the current implementation of the Beta and StudentT distributions (see Beta logcdf method fails with array of values #4342). In addition, the unit tests for these functions seem to be very slow compared to those of other distributions. Is the incomplete_beta particularly slow?
  3. Slightly different, the logcdf method of the Poisson distribution (and its ZeroInflated counterpart) fails with a C-assertion (exiting the python process altogether) when asked to evaluate multiple invalid values. This is also a problem in the InverseGamma distribution (see InverseGamma logcdf method fails with invalid parameters when array is used #4340). It will be solved once this Theano issue is fixed (see Return NaN in C implementations of SciPy Ops aesara-devs/aesara#224). Update: I have found a temporary hack to "hide" this problem (which can be removed once those other issues are solved).
  4. Unittests: check_logcdf tests only for values in the output domain, ignoring the edge values, but we might want to test also for values at the edges as well as below or beyond the domain (which can be evaluated to either -np.inf or 0). For example pm.Bernoulli(p=.1).logcdf([-1, 2]) -> [-np.inf, 0]. Should we use more comprehensive extra-domain checks? Update: This is now implemented in Increase unittest check_logcdf coverage and fix issues with some distribution methods #4393.

Other changes not directly related to this PR:

  1. Added a missing pymc3_matches_scipy test to the BetaBinomial. Was there a reason why this was missing?
  2. Changed the order of the logp and random methods in the DiscreteWeibull distribution to be in line with the rest of the library.
  3. Removed unused local variables in the init methods of DiscreteWeibull and ZeroInflatedPoisson distributions.

@codecov
Copy link

codecov bot commented Dec 27, 2020

Codecov Report

Merging #4387 (dc4ce4a) into master (3cfee77) will increase coverage by 0.05%.
The diff coverage is 91.54%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #4387      +/-   ##
==========================================
+ Coverage   88.04%   88.09%   +0.05%     
==========================================
  Files          88       88              
  Lines       14482    14538      +56     
==========================================
+ Hits        12750    12807      +57     
+ Misses       1732     1731       -1     
Impacted Files Coverage Δ
pymc3/distributions/discrete.py 95.00% <91.54%> (-0.85%) ⬇️
pymc3/sampling_jax.py 0.00% <0.00%> (ø)
pymc3/distributions/multivariate.py 83.64% <0.00%> (+0.72%) ⬆️

@ricardoV94
Copy link
Member Author

I am sorry for all the changes after opening the PR. I think it is now ready for review.

Copy link
Contributor

@AlexAndorra AlexAndorra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, thanks @ricardoV94 !
This is a pretty big PR, so a second review by another core dev would be beneficial.

I agree that the logcdf methods that do not work with multiple values should raise a ValueError to explain the limitation to the user. Related to this, I added some comments to add the mutiple values' types (numpy array or theano tensor) to the docstrings

at the specified value.
Parameters
----------
value: numeric
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should probably add the types for multiple values then. Something like the following?

Suggested change
value: numeric
value: numeric or np.ndarray or theano.tensor


Parameters
----------
value: numeric
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
value: numeric
value: numeric or np.ndarray or theano.tensor

at the specified value.
Parameters
----------
value: numeric
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
value: numeric
value: numeric or np.ndarray or theano.tensor

at the specified value.
Parameters
----------
value: numeric
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
value: numeric
value: numeric or np.ndarray or theano.tensor

at the specified value.
Parameters
----------
value: numeric
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
value: numeric
value: numeric or np.ndarray or theano.tensor

at the specified value.
Parameters
----------
value: numeric
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
value: numeric
value: numeric or np.ndarray or theano.tensor

at the specified value.
Parameters
----------
value: numeric
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
value: numeric
value: numeric or np.ndarray or theano.tensor

@ricardoV94
Copy link
Member Author

Thanks for the review @AlexAndorra. I agree about the ValueError, although I have to check how to check that with Theano variables.

I was hoping someone would suggest a fix for those that use tt.arange. Scipy actually does a loop to return results for multiple values.

About the type hints in the docstring. What if we do a separate PR just for that, so that all logcdfs (discrete and continuous) are homogeneous? Right now, none of the other docstrings have it and they should as well.

Copy link
Member

@twiecki twiecki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of the doc-strings miss new-lines before Parameters and Returns, otherwise this looks great to me!

@ricardoV94
Copy link
Member Author

A lot of the doc-strings miss new-lines before Parameters and Returns, otherwise this looks great to me!

Thanks for reviewing it. I will fix those.

Unrelated to that, does anyone know a good way to check that value is not a scalar in order to raise an informative ValueError in those methods that fail with arrays/tensors?

@twiecki
Copy link
Member

twiecki commented Dec 30, 2020

You can try whether np.isscalar works on numpy and theano arrays.

@ricardoV94
Copy link
Member Author

ricardoV94 commented Dec 31, 2020

New commit fixes missing newlines in docstrings and raises informative TypeError in logcdf methods that only accept scalars.

I would address @AlexAndorra suggestion of including more informative docstring type hints, as well as expanding check_logcdf unittest to make sure that logcdf methods that fail with nonscalar values raise a TypeError in #4393. The reason for this is that the logcdf methods of some continuous distributions also need to be changed and I think it would be easier to keep track / review those in that separate PR. Do you agree?

…` making use of `tt.log1p` and `logaddexp`.

More informative comment on workaround for `Poisson.logcdf`.
@twiecki twiecki merged commit d871b80 into pymc-devs:master Dec 31, 2020
@twiecki
Copy link
Member

twiecki commented Dec 31, 2020

This is a great contribution, thanks @ricardoV94!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Suggestion: CDF methods for discrete variables
3 participants